Apple’s New MM1 Large Language Model Blurs the Lines Between Image and Text

Mar 16, 2024

2818

Apple‘s research team has taken a huge step forward with their new “MM1” multi-modal large language model. This exciting development was detailed in a recent paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training”, and it showcases a model with impressive capabilities in both image recognition and natural language reasoning.

The model is available in 3 billion, 7 billion and 30 billion parameter sizes

MM1 comes in three sizes: 3 billion, 7 billion, and 30 billion parameters. Researchers used these models to conduct experiments, pinpointing the key factors that influence performance. Interestingly, image resolution and the number of image tags have a greater impact than visual language connectors, and different pre-training data sets can significantly affect the model’s effectiveness.

The research team meticulously built MM1 using a “Mixture of Experts” architecture and a “Top-2 Gating” method. This approach not only yielded excellent results in pre-training benchmarks, but also translated to strong performance on existing multi-modal benchmarks. Even after fine-tuning for specific tasks, MM1 models maintained competitive performance.

Testing revealed that the MM1-3B-Chat and MM1-7B-Chat models outperform most similarly sized competitors in the market. These models particularly shine in tasks like VQAv2 (question answering based on an image and text), TextVQA (text-based question answering about an image), and ScienceQA (scientific question answering). However, the overall performance of MM1 doesn’t quite surpass Google’s Gemini or OpenAI‘s GPT-4V models (yet). While MM1 may not be the absolute leader yet, it still is a significant leap forward for Apple in artificial intelligence. The company also recently acquired DarwinAI, read more about that here.

RELATED:

(VIA)

Apple’s New MM1 Large Language Model Blurs the Lines Between Image and Text

The model is available in 3 billion, 7 billion and 30 billion parameter sizes

Samsung HW-LS60D Music Frame wireless speaker with Dolby Atmos, Spotify, and Tidal launched

Beelink SEi14 mini PC with Intel Core Ultra 5 chipset and eGPU support launched

Xiaomi Pad 7 series launch timeframe tipped

The model is available in 3 billion, 7 billion and 30 billion parameter sizes

RELATED ARTICLESMORE FROM AUTHOR

A New Open Source LLM, DBRX Claims to be the Most Powerful – Here are the Scores

Google DeepMind’s SIMA is Training to Become Your New In-Game Teammate, Here’s How

Claude 3 is the Newest AI Chatbot Competitor, Claims to Surpass ChatGPT & Google’s Gemini

Samsung HW-LS60D Music Frame wireless speaker with Dolby Atmos, Spotify, and Tidal launched

Beelink SEi14 mini PC with Intel Core Ultra 5 chipset and eGPU support launched

Xiaomi Pad 7 series launch timeframe tipped

RELATED ARTICLES MORE FROM AUTHOR