paint-brush
LLaVA-Phi: Related Work to Get You Caught Upby@textmodels
New Story

LLaVA-Phi: Related Work to Get You Caught Up

tldt arrow

Too Long; Didn't Read

The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs.
featured image - LLaVA-Phi: Related Work to Get You Caught Up
Writings, Papers and Blogs on Text Models HackerNoon profile picture
0-item

Abstract and 1 Introduction

2. Related Work

3. LLaVA-Phi and 3.1. Training

3.2. Qualitative Results

4. Experiments

5. Conclusion, Limitation, and Future Works and References

The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs. These models, representing a departure from the capabilities of the preLLM era, are equipped with advanced question-answering and visual comprehension skills. This progress is enabled by using LLMs as language encoding modules. Notable research in this domain includes the LLaVA-family [24, 25, 26, 32], the BLIP-family [8, 20], MiniGPT-4 [37], and others. Each has demonstrated significant advancements in managing visual-centric dialogues. However, a common limitation of these open-sourced Vision-Language Models (VLMs) is their substantial computational demands, typically ranging from 7B to 65B parameters. This requirement poses challenges for deployment on edge or mobile devices, especially in real-time applications. Gemini [33], a leader in this field, has released three versions of visionlanguage models, including the compact Gemini-Nano with 1.8B/3.25B parameters, tailored for smartphones. However, their models and data are not open-sourced. Another initiative, MobileVLM [6], has developed mobileLLaMA with 2.7B parameters to facilitate smaller vision-language models. Our paper explores and demonstrates the effectiveness of integrating vision-language models with open-sourced, smaller language models, assessing their potential and efficiency in a variety of applications.

Figure 1. LLaVA-Phi is adept at identifying and responding to complex questions with empathetic reasoning.

Figure 2. LLaVA-Phi can generate useful codes based on visual input and commands.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Yichen Zhu, Midea Group;

(2) Minjie Zhu, Midea Group and East China Normal University;

(3) Ning Liu, Midea Group;

(4) Zhicai Ou, Midea Group;

(5) Xiaofeng Mou, Midea Group.