导图社区 Deep Seek Janus学习笔记
这是一篇关于JanusDecoupling Visual Encodi的思维导图,主要内容包括:其他信息,Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架,解耦了多模态理解和生成的视觉编码,针对测试可以做的方向。
编辑于2025-02-05 19:21:22Janus:Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2411.07975
multimodal large models 两大传统发展路线
understanding 多模态理解
代表模型:LLaVA
using a vision encoder as a bridge to enable large language models (LLMs) to understand images
visual generation 文生图
diffusion-based
代表模型:Stable diffusion
autoregressive
P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
Unified Multimodal Understanding and Generation 已有理解与生成合一的多模态大模型
Connect multimodal understanding models with pretrained dif fusion models
代表模型:Emu
uses the output of the LLM as a condition for a pretrained diffusion model, and then relies on the diffusion model to generate images.
缺点
this approach cannot be considered a truly unified model, because the visual generation functionality is handled by the external diffusion model, while the multimodal LLM itself lacks the capability to directly generate images
Employ a single transformer to unify both multimodal understanding and generation tasks
优点
improves instruction-following for visual generation
unlocks potential emergent abilities
reduces model redundancy(冗余)
典型特点
Use a single vision encoder to process inputs for both two tasks
缺点(存在的问题)
The representations required by multimodal understanding and generation tasks differ significantly
多模态理解任务,需要抽取的是高纬度的抽象语义信息,包括语义推理信息In multimodal understanding tasks, the purpose of the vision encoder is to extract high-level semantic information (e.g., object categories or visual attributes within an image). The output of understanding task not only involves extracting information from images but also involves complex semantic reasoning. Therefore, the granularity of the vision encoder’s representation tends to mainly focus on high-dimensional semantic representation.
图像生成任务,需要抽取的是低纬度的,局部细节与整一致性的相关信息(空间结构和纹理细节的表达) In visual generation tasks, the main focus is on generating local details and maintaining global consistency in the image. The representation in this context necessitates a low-dimensional encoding that is capable of fine grained spatial structure and textural detail expression.
理解与生成这两种任务如果使用统一的特征提取方案,会导致需求冲突,因为这两种任务对特征提取的颗粒度是不同的。最终导致效果不佳 Unifying the representations of these two tasks within the same space will lead to conflicts and trade-offs. Consequently, existing unified models for multimodal understanding and generation often compromise on multimodal understanding performance, falling markedly short of the state-of-the-arts multimodal understanding models.
Unified multimodal framework that decouples visual encoding for multimodal understanding and generation 一个统一的多模态框架,解耦了多模态理解和生成的视觉编码
代表模型:Janus
two independent visual encoding pathways: one for multimodal understanding and one for multimodal generation, unified by the same transformer architecture
优点
Janus alleviates the conflict stemming from the different granular needs of multimodal understanding and generation and eliminates the need to make trade-offs between two tasks when selecting visual encoders.
Janus is flexible and extensible. After decoupling, both the understanding and generation tasks can adopt state-of-the-art encoding techniques specific to their domain. Moreover, it is possible for Janus to accommodate additional input types in the future, such as point clouds, EEG signals, or audio data, where independent encoders can extract features and then use a unified transformer to process them.
方案的主要贡献
The first to highlight the importance of decoupling visual encoding within the unified multimodal understanding and generation framework.
Janus 模型架构
text understanding
build-in tokenizer of LLM
convert text into discrete IDS
multimodal understanding
SigLIP
enconder to extract hight-dimensional semanitic features from images
visual generation
VQ tokenizer
convert images into discrete IDs
The built-in prediction head of the LLM is utilized for text predictions in both the pure text understanding and multimodal understanding tasks, while a randomly initialized prediction head is used for image predictions in the visual generation task. The entire model adheres to an autoregressive framework without the need for specially designed
实验结果
Multimodal Understanding 在多模态理解上 Janus与参数规模相当的模型对比 得分最高
Visual Generation 和相当参数规模的生图大模型相比整体得分最优
Ablation studies 消融研究 实验证明 用单一Encoder输入做多模态理解和生成 没有使用 多模态理解 + 生成 解耦Encoder的方式 效果好
针对测试可以做的方向
更强的多模态理解能力
可以用来代替人做文生图效果的主观打分
7B 模型能输入和输入的图像分辨率 384 * 384 ,影响其多模态理解能力的发挥
In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.
其他信息
https://github.com/deepseek-ai/Janus
浮动主题