导图社区 Deep Seek Janus学习笔记

Deep Seek Janus学习笔记

这是一篇关于JanusDecoupling Visual Encodi的思维导图，主要内容包括：其他信息，Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架，解耦了多模态理解和生成的视觉编码，针对测试可以做的方向。

编辑于2025-02-05 19:21:22

EDhSdbNE

他的近期作品查看更多>>

Deep Seek Janus学习笔记
这是一篇关于JanusDecoupling Visual Encodi的思维导图，主要内容包括：其他信息，Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架，解耦了多模态理解和生成的视觉编码，针对测试可以做的方向。

Deep Seek Janus学习笔记

社区模板帮助中心，点此进入>>

EDhSdbNE

他的近期作品查看更多>>

Deep Seek Janus学习笔记
这是一篇关于JanusDecoupling Visual Encodi的思维导图，主要内容包括：其他信息，Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架，解耦了多模态理解和生成的视觉编码，针对测试可以做的方向。

相似推荐
大纲

互联网9大思维
- 38.2k
- 971
- 2.4k
- 402
- 0
MindMaster
组织架构-单商户商城webAPP 思维导图。
- 17.4k
- 3
- 186
- 9
- 1
Kacyun
域控上线
- 3.5k
- 169
- 11
- 4
- 0
jackrao
python思维导图
- 8.2k
- 551
- 242
- 7
- 0
(*^▽^*)
css
- 3.0k
- 1
- 43
- 3
- 0
A张舫
CSS
- 5.3k
- 271
- 189
- 33
- 0
journey
计算机操作系统思维导图
- 6.8k
- 353
- 208
- 16
- 0
journey
计算机组成原理
- 3.3k
- 98
- 70
- 8
- 0
journey
IMX6UL(A7)
- 2.0k
- 41
- 5
- 0
- 0
Handler XU
考试学情分析系统
- 2.7k
- 51
- 10
- 1
- 0
蒋龙

Janus:Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2411.07975

multimodal large models 两大传统发展路线

understanding 多模态理解

代表模型：LLaVA

using a vision encoder as a bridge to enable large language models (LLMs) to understand images

visual generation 文生图

diffusion-based

代表模型：Stable diffusion

autoregressive

P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.

K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.

Unified Multimodal Understanding and Generation 已有理解与生成合一的多模态大模型

Connect multimodal understanding models with pretrained dif fusion models

代表模型：Emu

uses the output of the LLM as a condition for a pretrained diffusion model, and then relies on the diffusion model to generate images.

缺点

this approach cannot be considered a truly unified model, because the visual generation functionality is handled by the external diffusion model, while the multimodal LLM itself lacks the capability to directly generate images

Employ a single transformer to unify both multimodal understanding and generation tasks

优点

improves instruction-following for visual generation

unlocks potential emergent abilities

reduces model redundancy（冗余）

典型特点

Use a single vision encoder to process inputs for both two tasks

缺点（存在的问题）

The representations required by multimodal understanding and generation tasks differ significantly

多模态理解任务，需要抽取的是高纬度的抽象语义信息，包括语义推理信息In multimodal understanding tasks, the purpose of the vision encoder is to extract high-level semantic information (e.g., object categories or visual attributes within an image). The output of understanding task not only involves extracting information from images but also involves complex semantic reasoning. Therefore, the granularity of the vision encoder’s representation tends to mainly focus on high-dimensional semantic representation.

图像生成任务，需要抽取的是低纬度的，局部细节与整一致性的相关信息（空间结构和纹理细节的表达） In visual generation tasks, the main focus is on generating local details and maintaining global consistency in the image. The representation in this context necessitates a low-dimensional encoding that is capable of fine grained spatial structure and textural detail expression.

理解与生成这两种任务如果使用统一的特征提取方案，会导致需求冲突，因为这两种任务对特征提取的颗粒度是不同的。最终导致效果不佳 Unifying the representations of these two tasks within the same space will lead to conflicts and trade-offs. Consequently, existing unified models for multimodal understanding and generation often compromise on multimodal understanding performance, falling markedly short of the state-of-the-arts multimodal understanding models.

Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架，解耦了多模态理解和生成的视觉编码

代表模型：Janus

two independent visual encoding pathways: one for multimodal understanding and one for multimodal generation, unified by the same transformer architecture

优点

Janus alleviates the conflict stemming from the different granular needs of multimodal understanding and generation and eliminates the need to make trade-offs between two tasks when selecting visual encoders.

Janus is flexible and extensible. After decoupling, both the understanding and generation tasks can adopt state-of-the-art encoding techniques specific to their domain. Moreover, it is possible for Janus to accommodate additional input types in the future, such as point clouds, EEG signals, or audio data, where independent encoders can extract features and then use a unified transformer to process them.

方案的主要贡献

The first to highlight the importance of decoupling visual encoding within the unified multimodal understanding and generation framework.

Janus 模型架构

text understanding

build-in tokenizer of LLM

convert text into discrete IDS

multimodal understanding

SigLIP

enconder to extract hight-dimensional semanitic features from images

visual generation

VQ tokenizer

convert images into discrete IDs

The built-in prediction head of the LLM is utilized for text predictions in both the pure text understanding and multimodal understanding tasks, while a randomly initialized prediction head is used for image predictions in the visual generation task. The entire model adheres to an autoregressive framework without the need for specially designed

实验结果

Multimodal Understanding 在多模态理解上 Janus与参数规模相当的模型对比得分最高

Visual Generation 和相当参数规模的生图大模型相比整体得分最优

Ablation studies 消融研究实验证明用单一Encoder输入做多模态理解和生成没有使用多模态理解 + 生成解耦Encoder的方式效果好

针对测试可以做的方向

更强的多模态理解能力

可以用来代替人做文生图效果的主观打分

7B 模型能输入和输入的图像分辨率 384 * 384 ，影响其多模态理解能力的发挥

In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

其他信息

https://github.com/deepseek-ai/Janus

浮动主题