Meta Unveils Foundational Generative Ads Model (GEM) To Deliver Ads, Claims 5% Increase In Ad Conversions

Generative AI isn’t just being used to generate text and images, but it’s finding use-cases in other fields as well.

Meta has announced its Generative Ads Recommendation Model (GEM), a large-scale foundation model designed to optimize ad delivery across Facebook and Instagram. The company reports that GEM has delivered a 5% increase in ad conversions on Instagram and a 3% increase on Facebook Feed since its launch earlier this year in Q2. In Q3, architectural improvements doubled the performance gains from equivalent amounts of data and compute, positioning the model for continued scaling.

The Challenge: Billions of Interactions, Sparse Signals

GEM addresses several fundamental challenges in Meta’s advertising ecosystem. Every day, billions of user-ad interactions occur across Meta’s platforms, but meaningful signals like clicks and conversions are remarkably sparse. The model must process diverse data types including advertiser goals, creative formats, measurement signals, and user behaviors across multiple delivery channels. Additionally, training at this scale requires efficiently coordinating thousands of GPUs using advanced parallelism and system-level optimizations.

Building GEM: Architecture and Training

GEM represents what Meta describes as the largest foundation model for recommendation systems in the industry, trained at a scale comparable to large language models. The model is trained on both ad content and user engagement data from advertisements and organic interactions across Meta’s platforms.

The architecture organizes input features into two categories: sequence features (such as user activity history) and non-sequence features (such as user demographics, location, ad format, and creative elements). Custom attention mechanisms process each group independently while enabling cross-feature learning.

Several architectural innovations enable GEM’s scaling efficiency. For non-sequence features, the model builds on Meta’s Wukong architecture using stackable factorization machines with cross-layer attention, allowing it to identify which feature combinations are most predictive. For sequence features, GEM employs a pyramid-parallel structure that can process thousands of user interaction events, capturing long-term behavior patterns that reveal purchase intent.

A key innovation called InterFormer enables cross-feature learning without compressing user behavior sequences into simplified vectors. This preserves the full richness of user interaction data while maintaining computational efficiency. The model also implements multi-domain learning that allows it to leverage insights across Facebook, Instagram, and Business Messaging while tailoring predictions to each platform’s unique characteristics.

According to Meta, these architectural advances make GEM four times more efficient at driving ad performance gains compared to the company’s original ads recommendation ranking models.

Transferring Knowledge Across the Ad Stack

GEM functions as a central “teacher” model that transfers its knowledge to hundreds of specialized vertical models (VMs) that serve ads to users. Meta employs both direct and hierarchical transfer strategies, using techniques including knowledge distillation, representation learning, and parameter sharing.

A notable innovation is the Student Adapter, which addresses the problem of stale supervision caused by delays in model training. This lightweight component refines GEM’s outputs using the most recent ground-truth data, ensuring that downstream models receive current and relevant signals. Meta reports that these post-training techniques achieve twice the effectiveness of standard knowledge distillation methods.

Training at LLM Scale

Training GEM required what Meta describes as a complete overhaul of its training infrastructure. The model operates at a scale typically associated with modern large language models, utilizing thousands of GPUs through multi-dimensional parallelism.

Meta implemented several system-level optimizations including custom GPU kernels for variable-length user sequences, graph-level compilation in PyTorch 2.0, and memory compression techniques such as FP8 quantization. The company also developed specialized GPU communication methods that avoid resource contention between communication and compute workloads.

The results were substantial: a 23-fold increase in effective training FLOPs using 16 times more GPUs, while also improving Model FLOPS Utilization (MFU) by 1.43 times. Job startup time was reduced by 5 times, and PyTorch compilation time decreased by 7 times through caching strategies.

Looking Ahead

Meta plans to extend GEM’s learning to encompass the entire ecosystem of user interactions across all content modalities including text, images, audio, and video. The company envisions evolving GEM toward a unified engagement model capable of ranking both organic content and advertisements.

Future developments include scaling to even larger GPU clusters, implementing inference-time reasoning to optimize compute allocation, and enabling what Meta calls “agentic, insight-driven advertiser automation” to improve return on ad spend.

The announcement positions GEM as a paradigm shift in Meta’s ads recommendation system, applying foundation model techniques refined in the generative AI boom to the distinct challenges of advertising optimization at massive scale.

Posted in AI