Apple‘s research team has taken a huge step forward with their new “MM1” multi-modal large language model. This exciting development was detailed in a recent paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training”, and it showcases a model with impressive capabilities in both image recognition and natural language reasoning.

The model is available in 3 billion, 7 billion and 30 billion parameter sizes

MM1 comes in three sizes: 3 billion, 7 billion, and 30 billion parameters. Researchers used these models to conduct experiments, pinpointing the key factors that influence performance. Interestingly, image resolution and the number of image tags have a greater impact than visual language connectors, and different pre-training data sets can significantly affect the model’s effectiveness.

Apple

The research team meticulously built MM1 using a “Mixture of Experts” architecture and a “Top-2 Gating” method. This approach not only yielded excellent results in pre-training benchmarks, but also translated to strong performance on existing multi-modal benchmarks. Even after fine-tuning for specific tasks, MM1 models maintained competitive performance.

Testing revealed that the MM1-3B-Chat and MM1-7B-Chat models outperform most similarly sized competitors in the market. These models particularly shine in tasks like VQAv2 (question answering based on an image and text), TextVQA (text-based question answering about an image), and ScienceQA (scientific question answering). However, the overall performance of MM1 doesn’t quite surpass Google’s Gemini or OpenAI‘s GPT-4V models (yet). While MM1 may not be the absolute leader yet, it still is a significant leap forward for Apple in artificial intelligence. The company also recently acquired DarwinAI, read more about that here.

RELATED:

(VIA)