Multimodal LLMs

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

The authors aim to align vision models with human aesthetic standards in a retrieval system. To achive this, they utilize the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations. Experiments demonstrate that this method significantly enhances the aesthetic behaviors of the vision models, under several metrics. The proposed algorithm can be a general practice for aligning vision models with human values.

Why I find it interesting:
- the way modern vision models understand aesthetics of images is usually quite mechanistic and often does not align with our human perspective. It is interesting to see the work that tackles this problem and makes an attempt to better align the models with human intentions.
- The presented method looks sufficiently straightforward and easy to reproduce.
- I like how the rephrasing methods, even the most simplest one, such as “repeat”, are able to improve the quality of image retrieval.
- On the more negative side, I was concerned to see the substantial differences between the Accuracy and Aesthetics metrics with and without order consistency (Table 10). 20% difference in the Aesthetics score only due to the order images are placed in the prompt looks like a too big of a change. I’d love to see more discussion in the paper of these differences.

VisMin: Visual Minimal-Change Understanding

VisMin is a benchmark, developed by the Mila researchers, which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects.

Why I find it interesting:
- Fine-grained understanding of objects, attributes, and relationships between objects is what we expect from the visual-language models (VLMs) in real-world applications and it is crucial to properly evaluate such an understanding ability. Existing benchmarks evaluate this using primarily text hard-negatives. VisMin benchmark in this work focuses on visual ****hard-negatives.
- VisMin benchmark is well controlled and challenging.
- VisMin dataset and finetuned CLIP and Idefics2 models are available on huggingface as a robust training resource for VLMs.
Multi-Object Hallucination in Vision Language Models

This work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity.

Why I find it interesting:
- existing benchmarks for object hallucination mainly focus on the presence of a single object class rather than individual entities. VLMs start becoming distracted when focusing on multiple objects simultaneously.
- ROPE benchmark uses various instruction settings to detect hallucinations and tests VLMs in both heterogeneous and homogeneous settings, i.e. when each of the objects to be detected belongs to different object classes or to the same object class.
- the authors release a dataset and the code that can be used to evaluate VLMs on the multi-object hallucination
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Multimodal LLMs often struggle with efficiently processing intricate visual details, unlike humans who dynamically focus on specific image regions. The pipeline proposed in this work enhances visual CoT reasoning by focusing on key regions and providing step-by-step interpretability.

Why I find it interesting:
- There are lots of approaches to textual Chain-of-Thought reasoning with LLMs but I do not frequently see work focusing on other modalities, including the visual one, when it comes to CoT reasoning.
- the visual CoT dataset integrates diverse data, including text/doc, fine-grained understanding, charts, general VQA, and relation reasoning, and it is released on huggingface.
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

VITRON is a universal pixel-level vision VLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. This work illuminates the great potential of developing a more unified multimodal generalist.

Why I find it interesting:
- I am generally interested in the progress towards multimodal generalist models. The models that are able to effectively perform both segmentation, editing, inpainting and generation have a big potential in multiple real-world applications. The ability of such models to deal with multiple modalities, such as image and video in the VITRON’s case, add to their value.
- in addition to text, VITRON relies on continuous signal embeddings, supplementing with richer modality-preserved visual features that cannot be directly described through discrete text.
- VITRON decouples task-specific features from task-invariant features in order to integrate all the specialists and ensure the different modules (tasks) work together synergistically.
- the model checkpoints are released, the service is a research preview and intended for non-commercial use only.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM.

Why I find it interesting:
- yet another multimodal generalist model. In this work, the authors outline a hierarchy of VLM reasoning and understanding abilities as 1) pixel-level, 2) object-level, and 3) image-level.
- similarly to the previous work, the authors shift away from directly using LLM to connect each specialist. Instead, the model is trained end-to-end on one encoder, one decoder, and one LLM.
- checkpoints of 7B models and data are released on huggingface.
Evaluating Multiview Object Consistency in Humans and Image Models

The authors introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation.

Why I find it interesting:
- I am interested in the limitations of SOTA LLMs and VLMs, especially in terms of their ability to show consistency. This work introduced yet another view on the concept of consistency - the ability to identify from the set of images which contain the same/different objects, despite considerable viewpoint variation.
- Bad news (although not surprising): the authors evaluated performance of the DINOv2, MAE and CLIP models and found that humans outperform all models by a wide margin.
- The dataset of the 3D shape understanding released on huggingface, as well as the code.
FlexCap: Describe Anything in Images in Controllable Detail

A versatile flexible-captioning VLM FlexCap, capable of generating region-specific descriptions of varying lengths, is introduced in this paper. The authors from DeepMind demonstrate a localize-then-describe approach with FlexCap can be better at open-ended object detection than a describe-then-localize approach used with other VLMs. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.

Why I find it interesting:
- FlexCap opens the door to a variety of VLM applications that may require a different length of image captions. Different regions of the same image can be described in a controllable manner, e.g. using conditional prefixes for the object’s colour, material, purpose of use, OCR or action extraction.
- The authors unfortunately do not provide access to the model or the code ;(