The latest technology and digital news on the web

Human-centric AI news and analysis

Here’s how OpenAI’s bewitched DALL-E image architect works

It seems like every few months, addition publishes a apparatus acquirements paper or demo that makes my jaw drop. This month, it’s OpenAI’s new image-generating model, DALL·E.

This behemoth 12-billion-parameter neural arrangement takes a text explanation (i.e. “an armchair in the shape of an avocado”) and generates images to match it:

Generated images of avocado chairs

I think its pictures are pretty alarming (I’d buy one of those avocado chairs), but what’s even more absorbing is DALL·E’s adeptness to accept and render concepts of space, time, and even logic (more on that in a second).

In this post, I’ll give you a quick overview of what DALL·E can do, how it works, how it fits in with recent trends in ML, and why it’s significant. Away we go!

What is DALL·E and what can it do?

In July, DALL·E’s creator, the aggregation OpenAI, appear a analogously huge model called GPT-3 that wowed the world with its adeptness to accomplish human-like text, including Op Eds, poems, sonnets, and even computer code. DALL·E is a accustomed addendum of GPT-3 that parses text prompts and then responds not with words but in pictures. In one archetype from OpenAI’s blog, for example, the model renders images from the prompt “a living room with two white armchairs and a painting of the colosseum. The painting is army above a modern fireplace”:

DALLE generated images

Pretty slick, right? You can apparently already see how this might be useful for designers. Notice that DALL·E can accomplish a large set of images from a prompt. The pictures are then ranked by a second OpenAI model, called CLIP, that tries to actuate which pictures match best.

How was DALL·E built?

Unfortunately, we don’t have a ton of capacity on this yet because OpenAI has yet to broadcast a full paper. But at its core, DALL·E uses the same new neural arrangement architectonics that’s amenable for tons of recent advances in ML: the Transformer. Transformers, apparent in 2017, are an easy-to-parallelize type of neural arrangement that can be scaled up and accomplished on huge datasets. They’ve been decidedly advocate in accustomed accent processing (they’re the basis of models like BERT, T5, GPT-3, and others), convalescent the affection of Google Search results, translation, and even in predicting the structures of proteins.

Most of these big accent models are accomplished on astronomic text datasets (like all of Wikipedia or crawls of the web). What makes DALL·E unique, though, is that it was accomplished on sequences that were a aggregate of words and pixels. We don’t yet know what the dataset was (it apparently independent images and captions), but I can agreement you it was apparently massive.

How “smart” is DALL·E?

While these after-effects are impressive, whenever we train a model on a huge dataset, the agnostic apparatus acquirements architect is right to ask whether the after-effects are merely high-quality because they’ve been copied or memorized from the source material.

To prove DALL·E isn’t just regurgitating images, the OpenAI authors forced it to render some pretty abnormal prompts:

“A able high affection analogy of a giraffe turtle chimera.”


“A snail made of a harp.”


It’s hard to brainstorm the model came across many giraffe-turtle hybrids in its training data set, making the after-effects more impressive.

What’s more, these weird prompts hint at commodity even more alluring about DALL·E: its adeptness to accomplish “zero-shot visual reasoning.”

Zero-Shot Visual Reasoning

Typically, in apparatus learning, we train models by giving them bags or millions of examples of tasks we want them to allot and hope they pick up on the pattern.

To train a model that identifies dog breeds, for example, we might show a neural arrangement bags of pictures of dogs labeled by breed and then test its adeptness to tag new pictures of dogs. It’s a task with bound scope that seems almost quaint compared to OpenAI’s latest feats.

Zero-shot learning, on the other hand, is the adeptness of models to accomplish tasks that they weren’t accurately accomplished to do. For example, DALL·E was accomplished to accomplish images from captions. But with the right text prompt, it can also transform images into sketches:

Results from the prompt, “the exact same cat on the top as a sketch on the bottom”. From

DALL·E can also render custom text on street signs:

Results from the prompt “A store front that has the word ‘openai’ accounting on it’.” From

In this way, DALL·E can act almost like a Photoshop filter, even though it wasn’t accurately advised to behave this way.

The model even shows an “understanding” of visual concepts (i.e. “macroscopic” or “cross-section” pictures), places (i.e. “a photo of the food of china”), and time (“a photo of alamo square, san francisco, from a street at night”; “a photo of a phone from the 20s”). For example, here’s what it spit out in acknowledgment to the prompt “a photo of the food of china”:

“a photo of the food of china” from

In other words, DALL·E can do more than just paint a pretty account for a caption; it can also, in a sense, answer questions visually.

To test DALL·E’s visual acumen ability, the authors had it take a visual IQ test. In the examples below, the model had to complete the lower right corner of the grid, afterward the test’s hidden pattern.

A screenshot of the visual IQ test OpenAI used to test DALL·E from

“DALL·E is often able to solve matrices that absorb continuing simple patterns or basic geometric reasoning,” write the authors, but it did better at some problems than others. When the puzzles’s colors were inverted, DALL·E did worse–“suggesting its capabilities may be breakable in abrupt ways.”

What does it mean?

What strikes me the most about DALL·E is its adeptness to accomplish decidedly well on so many altered tasks, ones the authors didn’t even anticipate:

“We find that DALL·E […] is able to accomplish several kinds of image-to-image adaptation tasks when prompted in the right way.

We did not ahead that this adequacy would emerge, and made no modifications to the neural arrangement or training action to animate it.”

It’s amazing, but not wholly unexpected; DALL·E and GPT-3 are two examples of a greater theme in deep learning: that abnormally big neural networks accomplished on unlabeled internet data (an archetype of “self-supervised learning”) can be highly versatile, able to do lots of things weren’t accurately advised for.

Of course, don’t aberration this for accepted intelligence. It’s not hard to trick these types of models into attractive pretty dumb. We’ll know more when they’re openly attainable and we can start arena around with them. But that doesn’t mean I can’t be aflame in the meantime.

This commodity was accounting by Dale Markowitz, an Applied AI Architect at Google based in Austin, Texas, where she works on applying apparatus acquirements to new fields and industries. She also likes analytic her own life problems with AI, and talks about it on YouTube.

Published January 10, 2021 — 11:00 UTC

Hottest related news