Apple AI research examines spatial reasoning and ASL annotation

Apple hasn’t given up on spatial computing, judging by its studies.

Apple’s interest in AI models and their applications in spatial computing shows no signs of slowing down, even though some claim the Apple Vision Pro is dead.

In April 2026, it was argued that the Apple Vision Pro was a complete failure and therefore we would never see a successor product. This rumor, although it always seemed unreasonable, has since been called into question.

Although the company’s Vision Products group has seen some changes, there is ultimately hope for a new generation of Apple Vision Pro. Apple’s AI research suggests the company hasn’t abandoned its space-related plans.

On the contrary, new studies published on the Apple Machine Learning blog explore the use of LLMs in sign language annotation, 3D head modeling, and more. Apple researchers also developed a new benchmarking system to assess the spatial-functional intelligence of LLMs.

Comparative analysis of spatial-functional intelligence for multimodal LLMs

The article titled “From Where Things Are to What They Are For: Comparative Analysis of Spatial and Functional Intelligence for Multimodal LLMs” presents a new testing and scoring system for MLLMs.

Apple researchers developed a benchmarking framework that tests the spatial reasoning capabilities of MLLMs. Image credit: Apple

As the study explains, to mimic human understanding of a space and its objects, AI models rely on two distinct structures. This includes “a

representation that captures object arrangements and relational structure, and a functional representation that encodes possibilities, purposes, and use based on context.

In other words, a multimodal LLM must understand the geometry of a particular space, as well as the purpose and location of objects within it. Apple researchers say existing benchmarking methods, such as VSI-Bench, only test the first aspect, largely ignoring the second.

To combat this, they developed the Spatial-Functional Intelligence Benchmark, abbreviated as SFI-Bench. It is described as a video-based benchmark with 1,555 expert-annotated questions derived from 134 interior video analyses.

As for what SFI-Bench specifically tests, the study explains it quite simply:

“Beyond spatial cognition, SFI-Bench incorporates functional and knowledge-based reasoning, checking whether models understand what objects in the scene are used for, how they are operated, and how faults can be diagnosed.”

In other words, the benchmark tests whether AI models understand what an object is, where it is located, how it is used, what it is for and how it can be repaired.

Diagram of a living room navigation task: video scan images, 3D rendered room plan with colored paths and markers, and annotated text explaining questions, reasoning steps, and correct and incorrect answers.

Apple AI researchers tested the ability of LLMs to understand the world around them. Image credit: Apple.

If this sounds familiar, that’s because Google has had tools with this kind of spatial awareness since at least 2024. At its i/o conference that same year, Google’s AI model correctly identified an object in front of it as a record player and even suggested how to repair the device.

In practice, SFI-Bench would be used to test similar and more advanced AI models. Some of the tests mentioned include asking an LLM to identify the largest subset of bottles of the same brand on a cabinet, asking it to cancel the current program on a washing machine, what a TV remote control is used for, and much more.

Apple researchers tested several open source and proprietary AI models with their SFI-Bench framework. Unsurprisingly, Google Gemini 3.1 Pro had the best overall result, while Gemini-3.1-Flash-Lite placed third. OpenAI’s GPT-5.4-High took second place.

However, the study notes that “across all models, overall conditional counting emerges as a key bottleneck, revealing persistent limitations in compositional and logical reasoning.”

In other words, most current MLLMs “struggle with spatial memory, the integration of functional knowledge, and the link between perception and external knowledge.” Nonetheless, the study noted that models with internet access performed better than offline-only models.

As for potential applications in iOS, we could see Apple unveil a version of Siri with both spatial and contextual awareness. This would make sense, given that the company has partnered with Google for Apple Intelligence features.

It remains to be seen, however, if and when this will debut, or to what extent the AI might work.

Use AI models for sign language annotation

In a separate study, titled “Priming Sign Language Annotations with Sign Language Models,” Apple researchers explored how AI could be used to annotate sign language videos.

Diagram comparing text and sign alignment for sign language recognition, with labeled timelines, colored frame score grids, and stacked neural network blocks showing multi-scale dilated convolutions, personal attention, and separate one- and two-handed branches

Apple researchers have been exploring the use of AI for ASL annotation. Image credit: Apple

The company’s research team claims to have developed a “pseudo-annotation pipeline that takes signed video and English as input and generates a ranked set of likely annotations, including time intervals, for glosses, finger-spelled words, and sign classifiers.”

In doing so, they seek to reduce the time and cost of manually annotating hundreds of hours of sign language. This approach involved creating “simple but effective basic fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and ASL Citizen datasets (74% top-1 accuracy). »

Apple researchers have developed nearly 500 manual English annotations to the glossary. They validated them through back-translation, manual annotations, and pseudo-annotations for over 300 hours of ASL STEM Wiki and 7.5 hours of FLOWERS-ASL.

For testing, Claude Sonnet 4.5 was given a variation of a gloss prompt into English and had to translate it from ASL STEM Wiki manual annotations to the English reference text interpreted by the signers.

The study notes that “the errors mainly concerned cases where a sentence did not include any fingerspelling.” Although more work remains to be done, the researchers say that “their approach to fingerspell recognition and isolated sign recognition can be trained with modest GPU resources and could also be used for further iterations on pseudo-annotation pipelines.”

As for why Apple is pursuing this, it might have something to do with the long-rumored camera-equipped AirPods. The company may be considering expanding its live translation feature to include sign language.

3D Gaussian head reconstruction from multi-view captures

Another study titled “High-quality, large-scale 3D Gaussian head reconstruction from multi-view captures” explores how head models can be created from images using AI.

Flowchart of a neural network reconstructing a woman's 3D head from multiple photos, showing foreground and background ResNet encoders, transformer blocks, Gaussian decoders, and outputs rendered versus ground truth

Apple AI researchers have explored how LLMs can be used to create 3D head models from multi-view captures. Image credit: Apple.

Apple researchers developed “HeadsUp, a scalable feedback method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups.” »

Essentially, the study explores how different head views can be converted into Gaussian blobs and then 3D models via a series of encoders and decoders.

To test their 3D model image method, the study leaders used “an in-house dataset with over 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets.” The 3D head models were also animated using expression blend shapes.

Overall, the study explains that “HeadsUp achieves state-of-the-art reconstruction quality and generalizes to new identities without test time optimization.”

In terms of practical applications, the study could be linked to the Apple Vision Pro and its Persona functionality. Apple may be looking for ways to improve how expressions are rendered or how faces themselves are captured and rendered in visionOS.

There may also be applications related to hardware or comfort. During development of the headset, woozad learned that the company included different types of 3D heads alongside the Apple Vision Pro models.

Time will tell what Apple does with the information its researchers create. Even if we have to wait to see what its next product will be, one thing is certain: the company is not backing down when it comes to AI and spatial computing.

Apple is set to announce iOS 27 and corresponding operating system updates at WWDC 2026, which begins on June 8.