OpenAI Releases IndQA Benchmark to Track AI Model Progress on 12 Indian Languages

OpenAI keeps getting ever more serious about its second-largest market of India.

OpenAI has launched IndQA, a comprehensive new benchmark designed to evaluate how well AI models understand and reason about Indian culture, languages, and everyday life. The benchmark spans 2,278 questions across 12 languages and 10 cultural domains, created in partnership with 261 domain experts from across India.

Beyond Translation: Testing Cultural Understanding

Unlike existing multilingual benchmarks that focus primarily on translation or multiple-choice tasks, IndQA says it probes culturally nuanced, reasoning-heavy questions that existing evaluations struggle to capture. The benchmark covers topics including Architecture & Design, Arts & Culture, Everyday Life, Food & Cuisine, History, Law & Ethics, Literature & Linguistics, Media & Entertainment, Religion & Spirituality, and Sports & Recreation.

The 12 languages covered are Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. OpenAI specifically included Hinglish given the prevalence of code-switching in Indian conversations.

Expert-Driven, Adversarially Filtered

IndQA’s development involved a rigorous multi-step process. Domain experts—including journalists, linguists, scholars, artists, and industry practitioners—drafted difficult, reasoning-focused prompts tied to their regions and specialties. These ranged from a Nandi Award-winning Telugu actor with over 750 films to an International Chess Grandmaster and a professor specializing in Odishan temple architecture.

Each question underwent adversarial filtering, tested against OpenAI’s strongest models including GPT-4o, o3, GPT-4.5, and GPT-5. Only questions where a majority of these models failed to produce acceptable answers were retained, ensuring the benchmark maintains headroom for measuring future progress.

The benchmark uses a rubric-based grading approach. Each response is evaluated against criteria written by domain experts for that specific question, with weighted point values based on importance. A model-based grader checks whether each criterion is met, and the final score represents the sum of points achieved out of the total possible.

Current Performance Reveals Room for Growth

OpenAI’s latest results show significant improvement over time, but substantial gaps remain. GPT-5 Thinking High, the top performer, achieved just 34.9% on the benchmark. Gemini 2.5 Pro Thinking scored 34.3%, while earlier models like GPT-4 Turbo managed only 12.1%.

Performance varied significantly across languages, with Hindi showing stronger results than Telugu or Bengali. Similarly, domains like History and Arts & Culture saw better performance than Law & Ethics or Architecture & Design.

OpenAI has acknowledged important caveats: because questions differ across languages, IndQA shouldn’t be interpreted as a language leaderboard. Additionally, the adversarial filtering against OpenAI’s own models could potentially disadvantage them compared to non-OpenAI models in relative performance comparisons.

Strategic Implications for India’s AI Landscape

Doubling Down on the Second-Largest Market

OpenAI’s investment in IndQA signals its commitment to India, which represents ChatGPT’s second-largest market globally. OpenAI has already made its Go plan free in India, and is setting up an office in the country. With approximately one billion Indians who don’t use English as their primary language and 22 official languages—including at least seven with over 50 million speakers—the market opportunity is substantial. This benchmark development represents a strategic bet on deepening engagement with Indian users and improving product-market fit in a critical geography.

Challenge to Sarvam and India’s Indigenous AI Ambitions

The launch of IndQA puts OpenAI in direct competition with Sarvam AI, the Indian startup backed by the government that’s building India-specific language models. While Sarvam has positioned itself as understanding the nuances of Indian languages and contexts better than foreign competitors, OpenAI’s comprehensive benchmark—built with 261 Indian experts—could undermine that positioning.

For Sarvam and other Indian AI companies, this raises the stakes considerably. If OpenAI establishes IndQA as the standard for evaluating Indic language capabilities, it effectively controls the metrics by which success is measured in this domain. This could make it harder for Indian startups to differentiate themselves, even if their models perform well on locally-developed benchmarks.

The Problem of Expert Bias and Foreign Ownership

While OpenAI’s reliance on Indian domain experts lends credibility to IndQA, it also introduces potential concerns about bias and representation. The selection of 261 experts, regardless of how diverse, represents a curated perspective on what matters in Indian culture and languages. Questions about foreign control of India’s narratives — always a touchy issue in India — remain relevant.

More fundamentally, having a US-based company create what could become the de facto standard for evaluating AI performance on Indian languages raises questions about technological sovereignty. An Indian institution or consortium creating an Indic benchmark that achieved widespread adoption would have offered several advantages: greater legitimacy in representing diverse Indian perspectives, alignment with national interests in AI development, and the ability to iterate based on local feedback and priorities.

What’s Next

OpenAI has indicated that while India was the obvious starting point given its scale and linguistic diversity, the company plans to create similar benchmarks for other languages and regions. The release of IndQA is intended to inform and inspire new benchmark creation from the research community, particularly for languages and cultural domains poorly covered by existing evaluations.

For India’s AI ecosystem, the key question is whether local players will rise to the challenge—not just by performing well on IndQA, but by creating alternative benchmarks that better represent Indian priorities and perspectives in the global AI conversation.