1. Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

    The authors aim to align vision models with human aesthetic standards in a retrieval system. To achive this, they utilize the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations. Experiments demonstrate that this method significantly enhances the aesthetic behaviors of the vision models, under several metrics. The proposed algorithm can be a general practice for aligning vision models with human values.

    Why I find it interesting:

    image.png

    image.png

image.png

  1. VisMin: Visual Minimal-Change Understanding

    VisMin is a benchmark, developed by the Mila researchers, which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects.

    Why I find it interesting:

    image.png

  2. Multi-Object Hallucination in Vision Language Models

    This work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity.

    Why I find it interesting:

    image.png

  3. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

    Multimodal LLMs often struggle with efficiently processing intricate visual details, unlike humans who dynamically focus on specific image regions. The pipeline proposed in this work enhances visual CoT reasoning by focusing on key regions and providing step-by-step interpretability.

    Why I find it interesting:

    image.png

  4. VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

    VITRON is a universal pixel-level vision VLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. This work illuminates the great potential of developing a more unified multimodal generalist.

    Why I find it interesting:

    image.png

  5. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

    OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM.

    Why I find it interesting:

    image.png

    image.png

  6. Evaluating Multiview Object Consistency in Humans and Image Models

    The authors introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation.

    Why I find it interesting:

    image.png

    image.png

  7. FlexCap: Describe Anything in Images in Controllable Detail

    A versatile flexible-captioning VLM FlexCap, capable of generating region-specific descriptions of varying lengths, is introduced in this paper. The authors from DeepMind demonstrate a localize-then-describe approach with FlexCap can be better at open-ended object detection than a describe-then-localize approach used with other VLMs. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.

    Why I find it interesting:

    image.png

    image.png