


Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.
Apr 23, 2024 am 08:04 AMA large model that can automatically analyze the content of PDFs, web pages, posters, and Excel charts is not too convenient for part-time workers.
The InternLM-XComposer2-4KHD (abbreviated as IXC2-4KHD) model proposed by Shanghai AI Lab, the Chinese University of Hong Kong and other research institutions makes this a reality.
Compared with other multi-modal large models that do not exceed the resolution limit of 1500x1500, this work increases the maximum input image of the multi-modal large model to more than 4K (3840 x1600) resolution, and supports any aspect ratio and dynamic resolution changes from 336 pixels to 4K.
Three days after its release, the model topped the Hugging Face visual question and answer model popularity list.
Easy 4K image understanding
Let’s take a look at the effect first~
The researcher inputs the paper (ShareGPT4V: Improving Large Multi-Modal Models with Better Captions) (resolution is 2550x3300), and asked the paper which model has the highest performance on MMBench.
It should be noted that this information is not mentioned in the text part of the input screenshot, but only appears in a rather complicated radar chart. Faced with such a tricky question, IXC2-4KHD successfully understood the information in the radar chart and answered the question correctly.
Faced with more extreme resolution image input (816 x 5133), IXC2-4KHD easily understands that the image consists of 7 parts and accurately explains what each part contains. Text message content.
Subsequently, the researchers also comprehensively tested the capabilities of IXC2-4KHD on 16 multi-modal large model evaluation indicators, of which 5 evaluations (DocVQA, ChartQA, InfographicVQA , TextVQA, OCRBench) focuses on the model’s high-resolution image understanding capabilities.
Using only 7B parameters, IXC2-4KHD achieved results that were comparable to or even surpassed GPT4V and Gemini Pro in 10 of the evaluations, demonstrating that it is not limited to high-resolution image understanding, but also for various tasks and Scenario versatility.
△With only 7B parameters, the performance of IXC2-4KHD is comparable to GPT-4V and Gemini-Pro. How to achieve 4K dynamic resolution?
In order to achieve the goal of 4K dynamic resolution, IXC2-4KHD includes three main designs:
(1) Dynamic resolution training:
△4K resolution image processing strategy
In the framework of IXC2-4KHD, the input image is randomly enlarged to a value between the input area and the maximum area (not exceeding An intermediate size (55x336x336, equivalent to 3840x1617 resolution).
Subsequently, the image is automatically cut into multiple 336x336 areas to extract visual features respectively. This dynamic resolution training strategy allows the model to adapt to visual input of any resolution, while also making up for the problem of insufficient high-resolution training data.
Experiments show that as the upper limit of dynamic resolution increases, the model achieves stable performance improvement on high-resolution image understanding tasks (InfographicVQA, DocVQA, TextVQA), and it still does not reach the upper limit at 4K resolution. world, demonstrating the potential for further expansion at higher resolutions.
(2) Add tile layout information:
In order to enable the model to adapt to changing dynamic resolutions, the researchers found that it is necessary to add tile layout information information as additional input. To achieve this, the researchers adopted a simple strategy: a special ‘newline’ (‘ n ’) token is inserted after each row of tiles to inform the model of the layout of the tiles. Experiments show that adding tile layout information has little impact on dynamic resolution training with relatively small changes (HD9 represents that the number of tile areas does not exceed 9), but can bring significant performance improvements to dynamic 4K resolution training .
(3) Expanding the resolution during the inference phase
Researchers also found that models using dynamic resolution can be directly expanded during the inference phase by increasing the maximum tile upper limit resolution and bring additional performance gains. For example, by testing a trained model on HD9 (up to 9 blocks) directly using HD16, a performance improvement of up to 8% can be observed on InfographicVQA.
IXC2-4KHD increases the resolution supported by multi-modal large models to the 4K level. Researchers said that currently this method supports larger images by increasing the number of tiles. The input strategy encountered computational cost and video memory bottlenecks, so they plan to propose a more efficient strategy to support higher resolutions in the future.
Paper link:
https://arxiv.org/pdf/2404.06512.pdf
Project link:
https://github.com /InternLM/InternLM-XComposer
— Finished—
Please send an email to:
ai@qbitai.com
Indicate the title and tell us :
Who are you, where are you from, submission content
Attach the paper/project homepage link and contact information
We will (try our best) to reply to you in time
Click here to follow me and remember to star~
One click three times to "share", "like" and "watch"
The cutting-edge progress of science and technology will be seen every day~
The above is the detailed content of Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

When you open PyCharm for the first time, you should first create a new project and select a virtual environment, and then be familiar with the editor area, toolbar, navigation bar, and status bar. Set up Darcula themes and Consolas fonts, use smart tips and debugging tools to get more efficient, and learn Git integration.

Social security number verification is implemented in PHP through regular expressions and simple logic. 1) Use regular expressions to clean the input and remove non-numeric characters. 2) Check whether the string length is 18 bits. 3) Calculate and verify the check bit to ensure that it matches the last bit of the input.

To develop a complete Python Web application, follow these steps: 1. Choose the appropriate framework, such as Django or Flask. 2. Integrate databases and use ORMs such as SQLAlchemy. 3. Design the front-end and use Vue or React. 4. Perform the test, use pytest or unittest. 5. Deploy applications, use Docker and platforms such as Heroku or AWS. Through these steps, powerful and efficient web applications can be built.

Verifying an IMEISV string in PHP requires the following steps: 1. Verify the 16-bit numeric format using regular expressions. 2. Verify the validity of the IMEI part through the Luhn algorithm. 3. Check the validity of the software version number. The complete verification process includes format verification, Luhn checking and software version number checking to ensure the validity of IMEISV.

There are three ways to install the NumPy library: 1. Use pip to install: pipinstallnumpy, which is simple but may encounter permissions or network problems; 2. Use conda to install: condainstallnumpy, which is suitable for Anaconda environment, and automatically resolves dependencies; 3. Install: gitclone from source code and compile, which is suitable for special needs but complicated processes.

Deploying and tuning Jenkins on Debian is a process involving multiple steps, including installation, configuration, plug-in management, and performance optimization. Here is a detailed guide to help you achieve efficient Jenkins deployment. Installing Jenkins First, make sure your system has a Java environment installed. Jenkins requires a Java runtime environment (JRE) to run properly. sudoaptupdatesudoaptininstallopenjdk-11-jdk Verify that Java installation is successful: java-version Next, add J

Create and manage multiple project workspaces in VSCode through the following steps: 1. Click the "Manage" button in the lower left corner, select "New Workspace", and decide the save location. 2. Give the workspace a meaningful name, such as "WebDev" or "Backend". 3. Switch the project in Explorer. 4. Use the .code-workspace file to configure multiple projects and settings. 5. Pay attention to version control and dependency management to ensure that each project has .gitignore and package.json files. 6. Clean useless files regularly and consider using remote development skills

Methods to improve the efficiency of VSCode code navigation in large code bases include: 1) using symbol navigation (Ctrl P and Ctrl T) to quickly find files and symbols; 2) using code jump (F12 or Ctrl Click) to jump directly to function definitions or variable declarations; 3) using global search (Ctrl Shift F) to accurately find code snippets; 4) install extension tools such as GitLens and Bookmarks to enhance navigation functions; 5) optimize project indexing and search performance, regularly clean useless files and use filtering conditions. These methods can significantly improve navigation efficiency in large code bases.
