国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
Some personal thoughts of the author
What is DrivingDiffusion?
How does DrivingDiffusion solve the above problems?
Consistency Module & Local Prompt
The significance and future work of DrivingDiffusion
Home Technology peripherals AI The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation

The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation

Oct 23, 2023 am 11:13 AM
technology Autopilot

Some personal thoughts of the author

In the field of autonomous driving, with the development of BEV-based sub-tasks/end-to-end solutions, high-quality multi-view training data and the corresponding simulation scene construction are increasingly important. In response to the pain points of current tasks, "high quality" can be decoupled into three aspects:

  1. Long tail scenarios in different dimensions: such as vehicles at close range in obstacle data As well as precise heading angles during car cutting, as well as scenes with different curvatures in lane line data or ramps/merges/merges that are difficult to collect. These often rely on large amounts of data collection and complex data mining strategies, which are costly.
  2. 3D True Value - High Consistency of Images: Current BEV data acquisition is often affected by errors in sensor installation/calibration, high-precision maps and the reconstruction algorithm itself. This makes it difficult for us to ensure that each set of [3D true values-image-sensor parameters] in the data is accurate and consistent.
  3. Time series data based on satisfying the above conditions: Multi-view images of consecutive frames and corresponding true values, which are necessary for current perception/prediction/decision-making/end-to-end and other tasks Indispensable.

For simulation, video generation that meets the above conditions can be generated directly through layout, which is undoubtedly the most direct way to construct multi-agent sensor input. DrivingDiffusion solves the above problems from a new perspective.

What is DrivingDiffusion?

  • DrivingDiffusion is a diffusion model framework for automatic driving scene generation, which implements layout controlled multi-view image/video generation and SOTA is implemented respectively.
  • DrivingDiffusion-Future, as a self-driving world model, has the ability to predict future scene videos based on single frame images and influence the motion planning of the main vehicle/other vehicles based on language prompts.

What is the effect of DrivingDiffusion generation?

Students in need can first take a look at the project homepage: https://drivingdiffusion.github.io

(1) DrivingDiffusion

Layout-controlled multi-view image generation

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

The figure shows the multi-view image generation effect with layout projection as input.

Adjust the layout: Precisely control the generated results

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

The upper part of the figure shows the diversity of generated results and the modules below The importance of design. The lower part shows the results of perturbing the vehicle directly behind, including the generation effects of moving, turning, colliding and even floating in the air.

Layout controlled multi-view video generation

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路



Top: Video generation results of DrivingDiffusion after training on nuScenes data. Bottom: Video generation results of DrivingDiffusion after training on a large amount of private real-world data.

(2) DrivingDiffusion-Future

Generate subsequent frames based on the text description of the input frame

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

Use a single frame image as input to construct subsequent frame driving scenes based on the text description of the main car/other cars. The first three rows and the fourth row in the figure show the generation effect after text description control of the behavior of the main vehicle and other vehicles respectively. (The green box is the input, the blue box is the output)

Generate subsequent frames directly based on the input frame

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

No need for other controls, only Use a single frame of image as input to predict subsequent frames of driving scenes. (The green box is the input, the blue box is the output)

How does DrivingDiffusion solve the above problems?

DrivingDiffusion first artificially constructs all 3D true values ??(obstacles/road structures) in the scene. After projecting the true values ??into Layout images, this is used as model input to obtain the multi-camera perspective. Real images/videos. The reason why 3D true values ??(BEV views or encoded instances) are not used directly as model input, but parameters are used for post-projection input, is to eliminate systematic 3D-2D consistency errors. (In such a set of data, 3D true values and vehicle parameters are artificially constructed according to actual needs. The former brings the ability to construct rare scene data at will. , the latter eliminates the error of geometric consistency in traditional data production.)

There is still one question left at this time: whether the quality of the generated image/video can meet the usage requirements ?

When it comes to constructing scenarios, everyone often thinks of using a simulation engine. However, there is a large domain gap between the data it generates and the real data. The generated results of GAN-based methods often have a certain bias from the distribution of actual real data. Diffusion Models are based on the characteristics of Markov chains that generate data by learning noise. The fidelity of the generated results is higher and is more suitable for use as a substitute for real data.

DrivingDiffusion directly generates

sequential multi-view views according to artificially constructed scenes and vehicle parameters, which can not only be used as a reference for downstream autonomous driving tasks Training data can also be used to build a simulation system for feedback on autonomous driving algorithms.

The "artificially constructed scene" here only contains obstacles and road structure information, but DrivingDiffusion's framework can easily introduce layout information such as signboards, traffic lights, construction areas, and even low-level occupation grid/depth map and other control modes.

Overview of DrivingDiffusion method

There are several difficulties when generating multi-view videos:

    Relatively common For image generation, multi-view video generation adds two new dimensions:
  • perspective and timing. How to design a framework that can generate long videos? How to maintain cross-view consistency and cross-frame consistency?
  • From the perspective of autonomous driving tasks, instances in the scene are crucial. How to ensure the quality of generated instances?
DrivingDiffusion Mainly designed a general training framework, using the stable-diffusion-v1-4 model as a pre-training model for images, and using 3D pseudo-convolution to expand the original image input to process the input of new dimensions in perspective/time series 3D-Unet, after obtaining the diffusion model to handle the new dimensions, carried out alternate iterative video expansion, ensuring short timing and Overall consistency over long periods of time. In addition, DrivingDiffusion proposed Consistency Module and Local Prompt, which respectively solve the problems of cross-view/cross-frame consistency and instance quality.

DrivingDiffusion generates long video process

Single frame multi-view model: generate multi-view key frames,

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路 The single-view timing model with frames as additional control and multi-view sharing: perform timing expansion on each view in parallel,

    A single-frame multi-view model with the generated results as additional control: fine-tune subsequent frames in parallel timing,
  1. Determine new keyframes and extend the video through a sliding window.
  2. Training framework for cross-view models and temporal models
  • For multi-view models and sequential models, the extended dimensions of 3D-Unet are perspective and time respectively. Both have the same layout controller. The author believes that subsequent frames can obtain information in the scene from multi-view key frames and implicitly learn the associated information of different targets. Both use different consistency attention modules and the same Local Prompt module respectively.
  • Layout encoding: Obstacle category/instance information and road structure segmentation layout are encoded into RGB images with different fixed encoding values, and the layout token is output after encoding.
  • Key frame control: All timing expansion processes use the multi-view image of a certain key frame. This is based on the assumption that subsequent frames in a short timing sequence can obtain information from the key frame. All fine-tuning processes use the key frame and the multi-view image of a subsequent frame generated by it as additional controls, and output the multi-view image after optimizing the cross-view consistency of the frame.
  • Optical flow prior based on a specific perspective: For the temporal model, only data from a certain perspective is sampled during training. In addition, the optical flow prior value of each pixel position under the perspective image that is calculated in advance is used, and is encoded as a camera ID token to perform interactive control of the hidden layer similar to time embedding in the diffusion process.

Consistency Module & Local Prompt

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

##Consistency Module is divided into two parts: Consistency attention mechanism and Consistency correlation loss.

The consistency attention mechanism focuses on the interaction between adjacent views and timing-related frames. Specifically, for cross-frame consistency, it only focuses on the information interaction between left and right adjacent views that overlap. For the timing model, each Frame only focuses on keyframes and the previous frame. This avoids the huge computational load caused by global interactions.

The consistent correlation loss adds geometric constraints by pixel-level correlation and regression of pose, whose gradient is provided by a pre-trained pose regressor. The regressor adds a pose regression head based on LoFTR and is trained using the true pose values ??on the real data of the corresponding data set. For multi-view models and time series models, this module supervises the camera relative pose and main vehicle motion pose respectively.

Local Prompt and Global Prompt cooperate to reuse the parameter semantics of CLIP and stable-diffusion-v1-4 to locally enhance specific category instance areas. As shown in the figure, based on the cross-attention mechanism of image tokens and global text description prompts, the author designs a local prompt for a certain category and uses the image token in the mask area of ??the category to query the local prompt. This process makes maximum use of the concept of text-guided image generation in the open domain in the original model parameters.

Overview of DrivingDiffusion-Future method

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

For future scene construction tasks, DrivingDiffusion-Future uses two methods: 1. One is to predict subsequent frame images (visual branch) directly from the first frame image, and use inter-frame optical flow as an auxiliary loss. This method is relatively simple, but the effect of generating subsequent frames based on text descriptions is average. Another way is to add a new concept branch based on the former, which predicts the BEV view of subsequent frames through the first frame BEV view. This is because the prediction of the BEV view helps the model capture the core information of the driving scene and establish concepts. . At this time, the text description acts on both branches at the same time, and the characteristics of the concept branch are acted on the visual branch through the perspective conversion module of BEV2PV. Some parameters of the perspective conversion module are pre-trained by using true-value images to replace the noise input (and in Freeze during subsequent training). It is worth noting that the

main vehicle control text description controller and other vehicle control/environment text description controller are decoupled.

Experimental Analysis

In order to evaluate the performance of the model, DrivingDiffusion uses frame-level Fréchet Inception Distance (FID) to evaluate the quality of the generated images, and accordingly uses FVD to evaluate the generated images. Video quality. All metrics are calculated on the nuScenes validation set. As shown in Table 1, compared with the image generation task BEVGen and the video generation task DriveDreamer in autonomous driving scenarios, DrivingDiffusion has greater advantages in performance indicators under different settings.

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

Although methods such as FID are often used to measure the quality of image synthesis, they do not fully feedback the design goals of the task, nor do they reflect the synthesis quality of different semantic categories. Since the task is dedicated to generating multi-view images consistent with 3D layouts, DrivingDiffuison proposes to use the BEV perceptual model metric to measure performance in terms of consistency: using the official models of CVT and BEVFusion as evaluators, using the same real 3D model as the nuScenes validation set Generate images conditionally on the layout, perform CVT and BevFusion inference on each set of generated images, and then compare the predicted results with the real results, including the average intersection over U (mIoU) score of the drivable area and the NDS of all object classes. The statistics are shown in Table 2. Experimental results show that the perception indicators of the synthetic data evaluation set are very close to those of the real evaluation set, which reflects the high consistency of the generated results and 3D true values ??and the high fidelity of the image quality.

首個多視角自動駕駛場景視頻生成世界模型 | DrivingDiffusion: BEV數(shù)據(jù)和仿真新思路

In addition to the above experiments, DrivingDiffusion has conducted experiments on adding synthetic data training to address the main problem it solves - improving the performance of autonomous driving downstream tasks. Table 3 demonstrates the performance improvements achieved by synthetic data augmentation in BEV perception tasks. In the original training data, there are problems with long-tail distributions, especially for small targets, close-range vehicles, and vehicle orientation angles. DrivingDiffusion focuses on generating additional data for these classes with limited samples to solve this problem. After adding 2000 frames of data focused on improving the distribution of obstacle orientation angles, NDS improved slightly, while mAOE dropped significantly from 0.5613 to 0.5295. After using 6000 frames of synthetic data that is more comprehensive and focused on rare scenes to assist training, a significant enhancement can be observed on the nuScenes validation set: NDS increased from 0.412 to 0.434, and mAOE decreased from 0.5613 to 0.5130. This demonstrates the significant improvement that data augmentation of synthetic data can bring to perception tasks. Users can make statistics on the distribution of each dimension in the data based on actual needs, and then supplement it with targeted synthetic data.

The significance and future work of DrivingDiffusion

DrivingDiffusion simultaneously realizes the ability to generate multi-view videos of autonomous driving scenes and predict the future, which is of great significance to autonomous driving tasks. Among them, layout and parameters are all artificially constructed and the conversion between 3D-2D is through projection rather than relying on learnable model parameters, which eliminates geometric errors in the previous process of obtaining data. , has strong practical value. At the same time, DrivingDiffuison is extremely scalable and supports new scene content layouts and additional controllers. It can also losslessly improve the generation quality through super-resolution and video frame insertion technology.

In autonomous driving simulation, there are more and more attempts at Nerf. However, in the task of street view generation, the separation of dynamic and static content, large-scale block reconstruction, decoupling appearance control of weather and other dimensions, etc., bring a huge amount of work. In addition, Nerf often needs to be carried out in a specific range of scenes. Only after training can it support new perspective synthesis tasks in subsequent simulations. DrivingDiffusion naturally contains a certain amount of general knowledge prior, including visual-text connections, conceptual understanding of visual content, etc. It can quickly create a scene according to needs just by constructing the layout. However, as mentioned above, the entire process is relatively complex, and the generation of long videos requires post-processing model fine-tuning and expansion. DrivingDiffusion will continue to explore the compression of perspective dimensions and time dimensions, as well as combine Nerf for new perspective generation and conversion, and continue to improve generation quality and scalability.

The above is the detailed content of The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Jan 17, 2024 pm 02:57 PM

Written above & the author’s personal understanding Three-dimensional Gaussiansplatting (3DGS) is a transformative technology that has emerged in the fields of explicit radiation fields and computer graphics in recent years. This innovative method is characterized by the use of millions of 3D Gaussians, which is very different from the neural radiation field (NeRF) method, which mainly uses an implicit coordinate-based model to map spatial coordinates to pixel values. With its explicit scene representation and differentiable rendering algorithms, 3DGS not only guarantees real-time rendering capabilities, but also introduces an unprecedented level of control and scene editing. This positions 3DGS as a potential game-changer for next-generation 3D reconstruction and representation. To this end, we provide a systematic overview of the latest developments and concerns in the field of 3DGS for the first time.

How to solve the long tail problem in autonomous driving scenarios? How to solve the long tail problem in autonomous driving scenarios? Jun 02, 2024 pm 02:44 PM

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving Feb 20, 2024 am 11:48 AM

Original title: SIMPL: ASimpleandEfficientMulti-agentMotionPredictionBaselineforAutonomousDriving Paper link: https://arxiv.org/pdf/2402.02519.pdf Code link: https://github.com/HKUST-Aerial-Robotics/SIMPL Author unit: Hong Kong University of Science and Technology DJI Paper idea: This paper proposes a simple and efficient motion prediction baseline (SIMPL) for autonomous vehicles. Compared with traditional agent-cent

The Stable Diffusion 3 paper is finally released, and the architectural details are revealed. Will it help to reproduce Sora? The Stable Diffusion 3 paper is finally released, and the architectural details are revealed. Will it help to reproduce Sora? Mar 06, 2024 pm 05:34 PM

StableDiffusion3’s paper is finally here! This model was released two weeks ago and uses the same DiT (DiffusionTransformer) architecture as Sora. It caused quite a stir once it was released. Compared with the previous version, the quality of the images generated by StableDiffusion3 has been significantly improved. It now supports multi-theme prompts, and the text writing effect has also been improved, and garbled characters no longer appear. StabilityAI pointed out that StableDiffusion3 is a series of models with parameter sizes ranging from 800M to 8B. This parameter range means that the model can be run directly on many portable devices, significantly reducing the use of AI

Choose camera or lidar? A recent review on achieving robust 3D object detection Choose camera or lidar? A recent review on achieving robust 3D object detection Jan 26, 2024 am 11:18 AM

0.Written in front&& Personal understanding that autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various sensors (such as cameras, lidar, radar, etc.) to perceive the surrounding environment, and using algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires that the three-dimensional object detection algorithm in the autonomous driving system can accurately perceive and describe objects in the surrounding environment, including their locations,

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

This article is enough for you to read about autonomous driving and trajectory prediction! This article is enough for you to read about autonomous driving and trajectory prediction! Feb 28, 2024 pm 07:20 PM

Trajectory prediction plays an important role in autonomous driving. Autonomous driving trajectory prediction refers to predicting the future driving trajectory of the vehicle by analyzing various data during the vehicle's driving process. As the core module of autonomous driving, the quality of trajectory prediction is crucial to downstream planning control. The trajectory prediction task has a rich technology stack and requires familiarity with autonomous driving dynamic/static perception, high-precision maps, lane lines, neural network architecture (CNN&GNN&Transformer) skills, etc. It is very difficult to get started! Many fans hope to get started with trajectory prediction as soon as possible and avoid pitfalls. Today I will take stock of some common problems and introductory learning methods for trajectory prediction! Introductory related knowledge 1. Are the preview papers in order? A: Look at the survey first, p

See all articles