ReALM: Apple’s New AI Model Outperforms GPT-4 in Contextual Data Parsing

In the realm of Artificial Intelligence (AI), Apple continues to make significant strides. The tech giant’s AI research has recently unveiled a model that promises to revolutionize the way we interact with Siri, Apple’s intelligent personal assistant. This new model, known as ReALM (Reference Resolution As Language Modeling), is designed to convert any given context into text, making it easier for Large Language Models (LLMs) to parse.

The Advent of ReALM

Apple’s AI research has been consistently published as the company gears up for a public launch of its AI initiatives in June during the Worldwide Developers Conference (WWDC). The research encompasses a wide range of topics, including an image animation tool. However, the latest paper, first shared by VentureBeat, focuses on the development of ReALM.

Reference Resolution: A Complex Issue

The concept of reference resolution involves a computer program performing a task based on vague language inputs, such as a user saying “this” or “that.” It’s a complex issue since computers can’t interpret images the way humans can. However, Apple seems to have found a streamlined resolution using LLMs.

ReALM: A Game Changer for Siri

You often provide context-dependent information when you interact with smart assistants like Siri. For example, you might say “Play the song I was listening to yesterday” or “Call the restaurant I visited last week”. Here, “the song I was listening to yesterday” and “the restaurant I visited last week” are pieces of contextual information. They refer to specific entities based on your past actions or the current state of your device.

ReALM

Traditionally, to understand and act upon these kinds of requests, smart assistants have to rely on large models that can process a wide range of inputs. These models often need to reference various types of data, including images. For instance, if you say “Email the photo I just took to John”, the assistant needs to understand which photo you’re referring to and who John is.

However, these traditional methods have their limitations. They require extensive computational resources due to the large size of the models and the complexity of the data they need to process. Moreover, they might not always be accurate, especially when dealing with ambiguous references.

This is where Apple’s new model, ReALM, comes into play. Instead of relying on large models and diverse reference materials, ReALM converts all contextual information into text. This approach simplifies the task for the language model as it now only needs to process textual data.

For example, instead of processing an image to understand the request “Email this photo to John”, ReALM would convert the image into a textual description, such as “a photo of a sunset over the ocean”. The request then becomes “Email the photo of a sunset over the ocean to John”, which is easier for the language model to understand and act upon.

This approach makes the process of parsing contextual data more efficient and less resource-intensive, potentially leading to faster and more accurate responses from Siri. It’s a significant step forward in the field of AI and has the potential to greatly improve the user experience with smart assistants.

ReALM vs. GPT-4: A Comparative Analysis

Apple found out that their smallest ReALM models had similar performance to GPT-4, but with fewer parameters, making them more suitable for on-device use. Increasing the parameters in ReALM resulted in it outperforming GPT-4 by a large margin.

GPT-4’s performance improvement can be attributed to its use of image parsing to comprehend on-screen information. Unlike artificial code-based web pages filled with text, GPT-4 relies on natural imagery for its image training data, which makes direct Optical Character Recognition (OCR) less effective.

The Efficiency of ReALM

By converting an image into text, ReALM can bypass the need for these advanced image recognition parameters, making it smaller and more efficient. Apple also mitigates issues with hallucination by including the ability to constrain decoding or use simple post-processing.

For instance, if a user is scrolling a website and decides to call the business, simply saying “call the business” requires Siri to parse what the user means given the context. Siri would be able to “see” that there’s a phone number on the page labeled as the business number and call it without further user prompt.

Apple’s Future AI Strategy

Apple is planning to release a comprehensive AI strategy during WWDC 2024. Rumors suggest that the company will rely on smaller on-device models that preserve privacy and security while licensing other companies’ LLMs for the more controversial off-device processing filled with ethical conundrums. This strategy, coupled with the development of models like ReALM, underscores Apple’s commitment to advancing AI research and improving user experience