The AI revolution has led to many ‘wow‘ moments for the tech world, but this one ranks right up there.
Toronto-based AI hardware startup Taalas emerged from stealth this week with a bold claim: it has built custom silicon that runs large language models nearly 10 times faster than the current state of the art, at a fraction of the cost. The company is backing that claim with a public product — a hard-wired implementation of Meta’s Llama 3.1 8B model, available now as both a chatbot demo called ChatJimmy and an inference API.
The debut product achieves 17,000 tokens per second per user, compared to roughly 1,800 tokens per second on leading GPU-based systems. It costs 20 times less to build and consumes 10 times less power than comparable GPU infrastructure. For context, that kind of speed pushes AI responses well past the threshold of human perception — interactions feel, for all practical purposes, instantaneous. While one is now used to waiting for a few seconds to get a response from AI models, the speed of ChatJimmy’s responses does take the breath away. Like Google used to show how quickly it searched the web in its early days, Chat Jimmy similarly shows how long it took to generate the result. For some test queries we threw at it, we got answers in 0.044 seconds, at 15,800 tokens per second.


Taalas: Rethinking the Hardware Stack From Scratch
Founded in 2023 by CEO Ljubisa Bajic — previously a founder at chip startup Tenstorrent — alongside co-founders Drago Ignjatovic and Lejla Bajic, Taalas has quietly spent the better part of three years building what it describes as a fundamentally different approach to AI inference hardware. The team of roughly two dozen engineers, many of whom have worked together for over two decades, developed their first product for just $30 million out of more than $219 million raised to date, including a reported $169 million round from investors including Fidelity earlier this year.
The company’s core insight is deceptively simple: rather than running an AI model as software on general-purpose hardware, make the model itself the computer. Taalas calls these “Hardcore Models” — AI architectures etched directly into custom silicon chips, with computation and storage unified on a single chip rather than separated across the memory-compute divide that has long constrained traditional inference hardware. The chip, thus, is optimized for a particular model, and helps generate its tremendous efficiency gains.

That divide, Bajic explains in the company’s launch post, is at the root of much of the complexity and cost in modern AI infrastructure — the high-bandwidth memory stacks, advanced packaging, liquid cooling systems, and massive power consumption that define today’s GPU data centers. By eliminating it entirely, Taalas was able to redesign the hardware stack from first principles, removing the need for exotic or expensive components and dramatically cutting system cost.
A Platform, Not Just a Chip
What makes Taalas particularly interesting to watch is that the Llama 3.1 8B is not the end goal — it’s a proof of concept for a broader platform. The company says it can take any previously unseen AI model and realize it in hardware within two months. The resulting chips support fine-tuning via low-rank adapters, and applications for them are written in natural human language rather than traditional code. Taalas frames it starkly on its homepage: “The Model is The Computer.”
A second product — a mid-sized reasoning LLM still based on the company’s first-generation silicon platform — is expected in labs this spring, with integration into the inference API to follow. A frontier-scale LLM built on Taalas’ second-generation silicon, which promises even higher density and faster execution, is planned for later this year.
The Bigger Picture
Taalas draws an explicit parallel to the arc of general-purpose computing — from ENIAC’s room-filling vacuum tubes to the transistor revolution that eventually put a computer in every pocket. The argument is that AI is following the same trajectory, and that today’s data center behemoths are the ENIAC moment: impressive, transformative, but ultimately impractical at the scale the world needs.
Whether Taalas can deliver on that vision at frontier model scale remains to be seen. The current Llama implementation uses aggressive quantization — combining 3-bit and 6-bit parameters — which the company acknowledges introduces some quality degradation relative to GPU benchmarks. The second-generation silicon adopts standard 4-bit floating-point formats to address this.
But the early numbers are striking enough that developers would do well to pay attention. Taalas is offering API access now, and the implications of sub-millisecond AI inference — for real-time agents, embedded applications, and human-AI collaboration at the speed of thought — are worth exploring sooner rather than later.