#601 The AI Bottleneck Is No Longer GPUs. It’s Energy and Memory…

In this episode of The CTO Show with Mehmet, Mehmet sits down with Eugene Cheah, CEO of Featherless AI. The AI bottleneck is no longer just GPU access. Power, memory, inference cost, and model reliability are becoming the real constraints.

Eugene reframes the AI infrastructure debate away from a simple race for bigger models and more chips. The conversation connects energy capacity, HBM shortages, open source model adoption, linear attention architectures, and the enterprise need for predictable AI systems. It also challenges the assumption that the best AI strategy is always to use the largest available model.

If you are building, investing in, or operating AI infrastructure, this conversation gives a clearer view of where AI economics, hardware constraints, and production reliability are heading.

About the Guest

Eugene Cheah is the CEO of Featherless AI, an AI startup making open source AI models accessible through a single platform.

Featherless AI started from AI research and optimization work around RWKV architecture, with a focus on reducing inference cost and making AI models more accessible. Eugene’s work sits at the intersection of open source AI, model efficiency, GPU infrastructure, HBM constraints, and inference optimization.

He is well positioned to frame this shift because Featherless AI works directly on the infrastructure layer between developers, open models, and production inference.

LinkedIn: https://www.linkedin.com/in/eugene-cheah-a47791126/

Website: https://featherless.ai

Key Takeaways

• AI infrastructure constraints are shifting from GPU access to power, memory, and inference efficiency.
• HBM scarcity becomes more serious as models and context windows continue to grow.
• Bigger models do not solve the enterprise problem of reliable execution.
• Open source models are becoming strong enough to replace many closed model use cases.
• Fine-tuned smaller models can outperform frontier models on narrow enterprise tasks.
• Nvidia’s moat weakens when developers can move workloads across more hardware choices.
• Linear attention architectures matter because quadratic memory scaling is economically unsustainable.
• Enterprises value model control when closed providers change, deprecate, or restrict models too often.

What You Will Learn

• The real infrastructure bottlenecks behind AI deployment beyond GPU availability.
• How HBM pressure affects model size, context length, and inference economics.
• Why energy capacity can delay AI infrastructure even when chips are already available.
• How open source models are changing enterprise AI adoption and deployment control.
• Why smaller fine-tuned models can beat larger models on specific production tasks.
• When linear attention architectures reduce memory demand compared with transformer attention.
• What hardware choice, model portability, and local inference mean for AI infrastructure strategy.

Episode Highlights

00:00 — AI infrastructure moves beyond the GPU race

03:30 — Nvidia, AMD, and Huawei follow different hardware strategies

07:30 — Power becomes the first AI infrastructure bottleneck

08:30 — HBM pressure exposes the memory constraint

12:00 — AI follows the same pluralism as databases

15:00 — Developers start with big models, then specialize

18:30 — Transformer memory scaling becomes an economic problem

23:30 — Hardware choice starts weakening platform lock-in

29:30 — Reliability matters more than raw intelligence

36:00 — Open source gives enterprises model control

41:30 — Small models can now build real applications

Resources Mentioned

• Featherless AI: https://featherless.ai
• RWKV architecture: AI architecture referenced by Eugene as part of Featherless AI’s research background

Listen Now

Available on all major podcast platforms and YouTube.

Connect with the Show

Follow The CTO Show with Mehmet for more conversations at the intersection of technology, startups, and venture capital.