Recently, the new image generation model released by Google and OpenAI have attracted widespread attention, and its core technology is completely different from previous models. Ethan Mollick's article in One Useful Thing explores the working mechanisms of these new models and their impact on human users. This article will interpret Mollick's views.
The potential of multimodal image generation
Mollick pointed out that traditional image generation systems are the product of the collaborative work of multiple models, and not a single model completes all tasks.
"In the past, large language model (LLM) generated images were not done directly by LLM. AI would send text prompts to independent image generation tools and then display the results. AI was responsible for creating text prompts, while another system with weaker capabilities was responsible for generating images."
The diffusion model has become a thing of the past
Old models rely mainly on diffusion model work. The working principle of the diffusion model is: introduce the image into noise, perform abstraction processing, and then remove the noise to generate an image that matches the prompt in the computer's known image library.
However, the limitation of this method is that the generated images lack the model's own reasoning and judgment, and are just a simple combination of existing image libraries and cannot provide valuable information.
Advantages of multimodal control
Today, the emergence of multimodal control technology has completely changed this situation.
Mollick gave an example: prompting the model to generate "a room without an elephant and mark the reason". Traditional models generate images containing elephants because it cannot understand the context of the prompt. The generated text may also be meaningless or even contain fictional characters, because the model's understanding of letters also stems from training data.
The multimodal model can accurately generate images that meet the requirements and add comments, such as "the door is too small", explaining why there are no elephants in the room.
Tip Challenges from Traditional Models
A significant drawback of traditional models is that once it is required to exclude an element, it will instead contain that element because it cannot understand the instructions. In addition, each modification or adjustment changes the basic structure of the image. For example, modifying a character's hat may lead to a complete change in the character's image.
The multimodal image generation model can make subtle adjustments on the basis of retaining the original results.
Environmental maintenance
Mollick also shows another example: an otter holding a specific item in one hand and then appears in a different context and in a different style. This demonstrates the fine integration capabilities of multimodal image generators.
Complete presentation
Mollick also shows how to design a complete presentation using multimodal models, such as a recommendation about guacamole. Just provide simple instructions, and the model can search for relevant information on the Internet, integrate it, and generate the final result.
As Mollick said, this will quickly lead to the replacement of many human work. We need to seriously consider establishing a corresponding framework.
The above is the detailed content of Mollick Presents The Meaning Of New Image Generation Models. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Remember the flood of open-source Chinese models that disrupted the GenAI industry earlier this year? While DeepSeek took most of the headlines, Kimi K1.5 was one of the prominent names in the list. And the model was quite cool.

By mid-2025, the AI “arms race” is heating up, and xAI and Anthropic have both released their flagship models, Grok 4 and Claude 4. These two models are at opposite ends of the design philosophy and deployment platform, yet they

But we probably won’t have to wait even 10 years to see one. In fact, what could be considered the first wave of truly useful, human-like machines is already here. Recent years have seen a number of prototypes and production models stepping out of t

Until the previous year, prompt engineering was regarded a crucial skill for interacting with large language models (LLMs). Recently, however, LLMs have significantly advanced in their reasoning and comprehension abilities. Naturally, our expectation

I am sure you must know about the general AI agent, Manus. It was launched a few months ago, and over the months, they have added several new features to their system. Now, you can generate videos, create websites, and do much mo

Built on Leia’s proprietary Neural Depth Engine, the app processes still images and adds natural depth along with simulated motion—such as pans, zooms, and parallax effects—to create short video reels that give the impression of stepping into the sce

A new study from researchers at King’s College London and the University of Oxford shares results of what happened when OpenAI, Google and Anthropic were thrown together in a cutthroat competition based on the iterated prisoner's dilemma. This was no

Picture something sophisticated, such as an AI engine ready to give detailed feedback on a new clothing collection from Milan, or automatic market analysis for a business operating worldwide, or intelligent systems managing a large vehicle fleet.The
