May 24, 2026

#601 The AI Bottleneck Is No Longer GPUs. It’s Energy and Memory | Eugene Cheah

Show Notes
Transcript

In this episode of The CTO Show with Mehmet, Mehmet sits down with Eugene Cheah, CEO of Featherless AI. The AI bottleneck is no longer just GPU access. Power, memory, inference cost, and model reliability are becoming the real constraints.

Eugene reframes the AI infrastructure debate away from a simple race for bigger models and more chips. The conversation connects energy capacity, HBM shortages, open source model adoption, linear attention architectures, and the enterprise need for predictable AI systems. It also challenges the assumption that the best AI strategy is always to use the largest available model.

If you are building, investing in, or operating AI infrastructure, this conversation gives a clearer view of where AI economics, hardware constraints, and production reliability are heading.

About the Guest

Eugene Cheah is the CEO of Featherless AI, an AI startup making open source AI models accessible through a single platform.

Featherless AI started from AI research and optimization work around RWKV architecture, with a focus on reducing inference cost and making AI models more accessible. Eugene’s work sits at the intersection of open source AI, model efficiency, GPU infrastructure, HBM constraints, and inference optimization.

He is well positioned to frame this shift because Featherless AI works directly on the infrastructure layer between developers, open models, and production inference.

LinkedIn: https://www.linkedin.com/in/eugene-cheah-a47791126/

Website: https://featherless.ai

Key Takeaways

AI infrastructure constraints are shifting from GPU access to power, memory, and inference efficiency.

HBM scarcity becomes more serious as models and context windows continue to grow.

Bigger models do not solve the enterprise problem of reliable execution.

Open source models are becoming strong enough to replace many closed model use cases.

Fine-tuned smaller models can outperform frontier models on narrow enterprise tasks.

Nvidia’s moat weakens when developers can move workloads across more hardware choices.

Linear attention architectures matter because quadratic memory scaling is economically unsustainable.

Enterprises value model control when closed providers change, deprecate, or restrict models too often.

What You Will Learn

The real infrastructure bottlenecks behind AI deployment beyond GPU availability.

How HBM pressure affects model size, context length, and inference economics.

Why energy capacity can delay AI infrastructure even when chips are already available.

How open source models are changing enterprise AI adoption and deployment control.

Why smaller fine-tuned models can beat larger models on specific production tasks.

When linear attention architectures reduce memory demand compared with transformer attention.

What hardware choice, model portability, and local inference mean for AI infrastructure strategy.

Episode Highlights

00:00 — AI infrastructure moves beyond the GPU race

03:30 — Nvidia, AMD, and Huawei follow different hardware strategies

07:30 — Power becomes the first AI infrastructure bottleneck

08:30 — HBM pressure exposes the memory constraint

12:00 — AI follows the same pluralism as databases

15:00 — Developers start with big models, then specialize

18:30 — Transformer memory scaling becomes an economic problem

23:30 — Hardware choice starts weakening platform lock-in

29:30 — Reliability matters more than raw intelligence

36:00 — Open source gives enterprises model control

41:30 — Small models can now build real applications

Resources Mentioned

Featherless AI: https://featherless.ai

RWKV architecture: AI architecture referenced by Eugene as part of Featherless AI’s research background

Listen Now

Available on all major podcast platforms and YouTube.

Connect with the Show

Follow The CTO Show with Mehmet for more conversations at the intersection of technology, startups, and venture capital.

[00:00:00]

Mehmet: Hello and welcome back to a new episode of The CTO Show With Mehmet. Today, I'm very pleased joining me, originally he settled in the San Francisco on the West Coast, but today he's in Singapore and he's late, uh, in his night, and thank you for that. I want to welcome Eugene Cheah. He's the CEO of Featherless AI.

Um, as you can guess from the word AI, we're gonna discuss as in all the recent episodes about AI, but today from a different perspective, like a topic which is personally I'm passionate about it and, you know, we've seen a lot of changes in that domain. So basically we're gonna talk about, you know, how we can utilize the AI in a much better way.

We're gonna talk about GPUs, we're gonna talk about, you know, the APIs around the AI, and I think, you know, I don't want to steal the, the time from you, Eugene. Um, just for people who might not know you, tell us a little more about you, your background, your journey, and what, you know, brought you to start Featherless?[00:01:00]

Eugene: Yeah. So I'm Eugene. Uh, a little bit of a background, uh, uh, about Featherless AI is that we are an AI startup tr- that is trying to make all of the open source AI accessible. What brought us here is that we actually originally started as an AI research company, and we still are, where, where we were trying to work on next generation AI architectures, uh, some- that is like 100 or 1,000X cheaper in inference based on the RWKV architecture.

So we, we were one of the first AI architecture and models under the Linux Foundation, and we were iterating on, on that next generation architecture. Uh, as we brought it into market, uh, we quickly realized that, hey, most people are actually excited about the current generation, and the next generation may require more research to, to, to, uh, to do maturity.

But we decided to like, uh, start applying what we knew about the next generation into the current generation models to optimize AI inference, and... [00:02:00] But more importantly, we wanted to make AI accessible. Uh, one thing that I, I say many times is like, my grandma speaks seven languages, but she doesn't speak English or Chinese.

And when I see the world of AI, uh, that is being presented, uh, when I was a kid, I like to see, like imagine the world of Jetsons, where everyone has their own AI robot helping their family. Not a world of AI where it's like centralized to a handful of models, and not particularly a world where the AI is dominated only by the English-speaking models or the Chinese-speaking models, but multiple models around the world that can represent each of us uniquely.

And that is the world that I'm trying to build with Featherless by trying to make AI- all of the AI, uh, models instantly accessible. And, and so where we started from AI optimization and research, trying to make AI, uh, much more cheaper and efficient, we quickly went on to just, not just about making [00:03:00] it cheaper to run, but more accessible so that everyone around the world in this future economy can have access to AI

Mehmet: Great.

And thank you again, Eugene, for being here with me today. Now, of course, w- before I come and record with my guest, I try to do some research and see, you know, c- a couple of things that you might have, you know, said here, um, uh, on, on maybe other podcasts or maybe in, in other, uh, medias. So I caught something interesting, and it's related to maybe, you know, more the hardware.

So you described the AI hardware race as StarCraft battle between Nvidia's Protoss, AMD's Terran, and Huawei's Zerg. Like, what does each strategy optimize for, and which one do you think wins the inference area, era?

Eugene: Well, you really dig hard on that one. Actually, I, I can't even remember. Yeah, I was planning to write an article on that, but I haven't wrote an article on that.

Um, the... Yeah. So the... [00:04:00] I, I think it's very easy to, to, to, to portray the, the tree, especially for those who are familiar with the StarCraft Terran-Zerg analogy. Uh, for those who are not familiar with this video game analogy- ... the, this, uh, the Protoss is the high tech. Think of it them like the elf-like or the extremely futuristic tech race, and they are the most expensive units in the game.

So you... So they, they are extremely expensive, they're extremely powerful, and they are the, they are, they are the best at what they do. Uh, the... And so that is reflective on Nvidia's, uh, pricing actually. Uh, Nvidia is the market leader. They have some of the best hardware in the market at, and they are priced at a premium Terrans are the humans in, in the game of the race, and they, they tend to be more, uh...

They are still f- uh, fa- uh, fairly advanced in tech and, and they [00:05:00] tend to, tend to be more practical, more nitty-gritty. Uh, they are-- You can build them in larger numbers because they are much more cost-effective than, than the Protoss, but they get the job done, and they, they are extremely effective units.

Zergs take the most extreme approach where they are the insect race, and where, where they are the cheaper units, and they are... None of the Zerg units are, like, as strong as the Terran units or the Protoss units, but they make it up in numbers. And this is actually quite a useful reflection between, between the three because, um, the Terran, the, uh, the Terran or AMD, uh, their hardware is good, in some cases on par, uh, with some of the best of NVIDIA's, uh, hardware.

Uh, and they are doing that catch-up respectively. They may not be as polished or as refined as the best hardware, but they get the job done. The [00:06:00] Huawei chips that right now that is, uh, uh, uh, that's being developed mostly within the Chinese market are, in part because of the current trade restrictions and all that, they may be inferior in specs.

Um, um, the, the latest Huawei chips are the equivalent or even still inferior to the NVIDIA A100. But a lot of analysts may end up dismissing, "Oh, it's behind." But they forget that if there's one thing that China does well, it's mass manufacturing.

Mehmet: Mm-hmm.

Eugene: And, and because they do... are not dependent on the supply chain that the, the rest of the world is, they can manufacture at a much larger numbers.

Um, the other major thing is, if you keep track of the AI race, a lot of the bottleneck is not just the hardware. It's in energy capacity. And within China, uh, uh, they... It's building up the largest energy rollout right now of any country [00:07:00] Um, and they are able to tap to this much cheaper energy as they roll this out.

So it's really the xerx of the, of the AI inference sensors. Well, that hardware may not be as good for training, it is perfectly fine for inference, and for every one high-end Nvidia chip, you c- you can probably buy 10 Faraday chips. And, and if the energy cost is cheap enough, you can outnumber it.

Mehmet: Right.

Eugene, there's-- y- I, I'm happy you, you brought up, you know, the, the, the bottleneck of the energy, but there's another bottleneck we started to hear recently about, um, because, you know, people think it's only about the GPUs and, you know, who's, you know, have the access to GPUs. Uh, there's a lot of, um, I would say, um, you know, causes for, for what's happening, which is, uh, and I think some people might have heard about it, which is the high bandwidth memory.

It's, uh, like, it's abbreviated [00:08:00] as HBM. Uh, do you think, like, this is another, um, you know, hidden constraint for AI infrastructure? And what happens, you know, if, if, um, you know, AI economics is, is, you know, from compute perspective available, but in terms of memory it's not?

Eugene: Right now... Okay, so what right now the main bottleneck is still energy to, to be clear. Um- Mm-hmm ... like from what I understood for a lot of data center providers, they literally have the high-end GPUs chip on standby waiting for energy to come online. So this includes with the HBM and everything. But I understand where the, the concerns about HBM come from because, uh, in particular, it's the one thing that even consumers feel right now.

Like your co- your laptop RAM and, and desktop RAM has tripled in price, i- in part because of the demands of AI for HBM and, and in particular for, for high bandwidth memory because it [00:09:00] shares the same, um, uh, same manufacturing line as standard, uh, uh, uh, computer memory, um, or memory in general. This is led in part by the demand for larger and larger AI models, um, and, and running at larger and larger context length.

So you may have heard ... Some people may have heard like, oh, uh, the AI model is, let's say, uh, s- uh, some of the best open AI models. No, open source AI model is like 1 to 2 trillion parameters. Uh, then you will also hear like allegedly, because this is not public figures, like the major closed set models are like 5 or 6 trillion parameters.

When we say trillions of parameters, you can convert that as terabytes, give and take, of GPU memory, where there are some techniques like FP4 quantization that may half the amount. We're still talking about terabytes of GPU memory. And GPU memory, VRAM, is much higher speed, [00:10:00] higher bandwidth than your PC memory.

How many of us own a computer with a terabyte of memory, let alone GPU memory? And so that is the second tier bottleneck. Um- Mm-hmm ... and in particular, as we are trying to push the models bigger and bigger with longer context, that there is that heavy demand for. However Research is already actively being done to reduce this demand.

Um, so you may have heard recently, like Google's TurboQuant paper. Uh, alternatively, like for most of the open source models right now that's being developed, it runs on what is called the MLA attention architecture, which kind of reduce the memory consumption from traditional attention architecture by 1/10th for each request.

Uh, likewise, on our side, the research that we do is on linear attention architecture, which some of it has already been applied into the latest, uh, Quant [00:11:00] 3.5 or 3.6 open source model, where those layers are essentially 100 reduce of, uh, mem- um, memory usage. But because they only apply to three or four of the layers, so you can think of it as only one quarter of the memory usage.

And we will keep researching in this direction, uh, so that we can, like manage the memory demands for the AI models respectively as we, uh, improve of it generation by generation. But until we go through the inflection point of rolling out the hardware, that's where the HBM is in short supply.

Mehmet: Right. Uh, a lot of points, you know, that I want to follow up, but I will take them one by one.

Now, um, s- so again, when in the introduction you mentioned about like, you know, why you built, uh, Fiddlerless, you know, and it's about, you know, the, uh, what you call it, the model pluralist, right? Mm. Why do you believe the market eventually shifts from the few giant models to thousands of specialized [00:12:00] ones?

Like, what's the philosophy behind it?

Eugene: The... It's the same rhythm that we actually see happening within the o- open source space, um, that happened in the past. Uh, I have... Whenever anything becomes too valuable, we as developers cannot help but try to replicate it to get a share of the pie. And whe- and whether, whether is it backed by, uh, mil- uh, billionaires or millionaires, we will try to ensure that happens.

And one major example that, where history has played out is, uh, within this space is, uh, itself, let's say computer server technologies, uh, operating systems, uh, co- uh, or even, even more specifically like database technology. Like take for example, when the IBM DB2 first came out before the dot-com bubble.

Mehmet: Mm-hmm.

Eugene: You can hear analysts and experts saying the same thing. Oh, they spent millions at that time, well, billions now, to [00:13:00] develop this fundamental piece of database technology. No one can compete with them. And then as the cost to develop these things go down, subsequently, uh, you will, uh, you, you hear, uh, MSSQL or, or, uh, Oracle SQL or...

That, that these are like competing database technologies from other major players that comes in. And then s- and then you start hearing people say, "Oh, you can't compete." It'll be these three giants will dominate the market. Which is- Mm-hmm ... at the stage that AI have. So for databases that took five years, for AI it took less than two years.

And then the next wave beyond that, it'll be, "Hey, there's the MySQL," uh, which is an o- the fir- one of the first major open source database. And then, uh, then subsequently you hear, "Oh, let's try something different." NoSQL. And then subsequently there's now like over 100 different versions of SQL with different open source packages.

And the end result was not a case of, [00:14:00] hey, uh, one database technology went to rule them all, uh, as, as some people would have expected. It was a case of like, if you look today, every enterprises use an entire collection of several databases, all with their strengths and weaknesses, and apply them for different use cases.

And that is going to happen with AI. We are already starting to see the early signs of that happening, right? For example, AI model, even for the same architecture, right? We are starting to realize that, hey, image and video generations, uh, prefer, let's say, the diffusion-based architecture. And, uh, there's already a lot of open source image and video generation models, and likewise text models may prefer a different set of architecture.

And we start to realize these things, and as the cost to make these AI models go down, and they have been going down We start to realize more and more players will, will, will create these open source models.

Mehmet: Right. Now, uh, for the people who [00:15:00] doesn't know, uh, you know, what you're doing is something that can simplify a lot of things for developers and, you know, people who want to build application over AI.

Because what you do, you host one of the largest collections of AI models behind just one single API. So just instead of putting hundreds of API keys, so it just like, like one, and then, you know, you have access to all these models. Are you seeing any special pat- patterns like, um, how developers and enterprise actually they are choosing models in production?

And the reason I'm asking you this, Eugene, you know, like even myself, uh, uh, you know, I, I don't consider myself too much into the technical side, but these AI agents, you know, like they let you scratch your head and say, "Hey, let me try this. Let me try this model." So are you seeing a- any s- any patterns on, on preferences like what really developers and enterprises want, especially with the, you know, rise of agentic AI now?

Eugene: Yeah. So it really changes de- uh, depending on the stage of the company, uh, I would say. So what we see very commonly is [00:16:00] like when first... when people first come u- uh, to try open source models, um, probably because they are more familiar with, with the closed source models, they'll, they'll start with the biggest and the best.

So we're talking about the, like the latest generation of the Qwen model, the Kimi model, or DeepSeek, like all these big names. And then subsequently, as they try and play around with these models, um, they s- they start to, they start to realize, "Hey, for certain specific use cases, I can use one of the open models that are specifically fine-tuned for these tasks that are much more reliable for, for those specific tasks."

Uh, and or the other path respectively could be, "Hey, my company has rather custom requirements, and I want to fine-tune a model based on one of the open models to be able to more efficiently tackle this." So w- so for, so for example, we have seen some financial institute where they specifically use some of the fine-tune models to actually support [00:17:00] coding and development w- w- in India-specific style of SQL and, and internal languages.

Um, for example, also recently Shopify, one of the major, uh, e-commerce player giants, they internally went to fine-tune a small 27B model specifically on Shopify's internal, uh, DSL programming language on how to customize the Shopify platform, which they are now exposing to all their customers through an AI agent respective- And they found that because they were able to tune it specifically to what they want, it was e- it was be- it was be- it was able to p- outperform even the best frontier model today in terms of the, the ability to modify the Shopify platform.

At a fraction of the cost. And so more and more companies are starting to realize as they scale up AI solutions, uh, that they can start to optimize this thing, customize these things, and to make things [00:18:00] more tailor-made to their platform at a much more efficient level. Um, it is s- still not something that is instantly done on day one because, like, as with any AI startup, I even advise, your first step is to build the prototype and to get users to use it.

Then your second step is to optimize it. If you are fine-tuning your model at the very start, you might be fine-tuning a model where there's only one customer, and that will not work.

Mehmet: Right. Now, back to what you mentioned before about, you know, the, this large language model and, you know, the b- bottle-bottlenecks.

Do you think that the architecture of these transformer, uh, models itself becomes unsustainable at scale because of, you know, all the constraints we, we talked about? Like, and i- if yeah, what would be, you know, the alternative? And I know, [00:19:00] like, you've helped developing, you know, and sorry if I'm, you know, not pronouncing it the right name.

I'm gonna say the abbreviations, the RWKV, which, or some people call it the R- Ro- RoKV, right? So, um, so h- h- how this is the whole thing is also shaping the way we, we architect the model itself, Eugene?

Eugene: Yeah. So one of the things that I find, um, weird in current AI architecture is that its memory consumption, back to the HBM, um, scales quadratically with context length.

So context length is the amount of tokens you put in. So the more ... As you chat longer and longer with ChatGPT, unless tricks were being done to compress it, the amount of memory it requires scales quadratically. This is unnatural. If I, as a human, for every second I'm alive, my brain needs to expand quadratically, by the time I'm 10, my brain would have exploded.[00:20:00]

It's an unnatural property of the AI model and architecture that we have developed in a way that works with the GPU. And, and that is part of our scaling limitations. That being said, uh, there is research that we have done where the AI model is not just made of the attention layer, which scales quadratically, but also the MLP and MOE layer, which we have validated in some, uh, papers where it's able to hold a large percentage of the intelligence of the model.

So this would also imply that there can be alternatives to attention. And s- and that is part of our work where we experiment with more efficient attention alternatives that are scale linear in cost. And that's more like a human because for humans, uh, we just burn the same amount of energy, give and take, for every second that we live.

Our energy consumption don't increase [00:21:00] quadratically. Um, and, and part of the ongoing research is how do we apply these alternatives? Um, the largest application of linear architecture would be currently the Qwen line of models. This is one of the strongest performing series of AI models that is on par with pretty much all the other open source models.

But three quarters of the AI architecture doesn't scale quadratically and scale linearly. Mm-hmm. What that translate to? Cheaper inference costs, uh, and less HBM demands. And this will be an essential part of making AI accessible because, uh, what ... Uh, well, I talk about the pra- uh, like immediate practical sides of AI, uh, accessibility, like commercial side.

Like, more importantly, if we want AI to be the future driver of all the world economy, which a lot of people predict, we need to make it accessible to the majority of the world that doesn't [00:22:00] speak English and, and we'll need to be able to afford AI at a much lower cost.

Mehmet: Mm-hmm.

Eugene: And if the minimum requirement for an AI model is a million dollar server because of its high bandwidth memory requirements- We are missing out a large percentage of the population.

How do we reduce that into the memory requirements that can run on a consumer-grade GPU, for example? And that, that would make it accessible to anyone if they have a gaming laptop.

Mehmet: Very interesting points. I will try to make them in order, you know, as, as, um, a follow-up to, to what you just mentioned. Let me start with this.

So we know that because you talked about the GPU, so we know like, you know, NVIDIA and the others, mainly NVIDIA, uh, what we call it- their moat. It's not just the chips, it's the CUDA system, you know, like the, the ecosystem that they have created, you know. Um, they, they did fantastic in, in, in developer gravity.

Uh, what I [00:23:00] noticed recently, whenever they see, they see something cool, they go and they implement it. For example, OpenCL was one, uh, was one of the examples. So they, they, they, you know, adopted it somehow also as well. But realistically, what could weaken their dominance? And I know like once you weaken the dominance, so that means the pricing are going down.

So what could be an event that can cause such a result?

Eugene: So the fundamental shift will actually happen when users have a lot more choice that they can switch in between. And these ... And elements are slowly falling in place. Uh, in some cases, um, uh, things even beyond like any major player's control. So take for example, um, today, AI models are getting better and better at, uh, writing GPU code.

And one, one of the things that I like to [00:24:00] demystify about AI is that one thing that is nice about AI models, uh, from a coding perspective or even a math perspective, is that if you try to re-implement the AI model code as a reference code from scratch, the average AI model is not even 1,000 line of code.

This is the case for when you typically implement it in PyTorch. It's really not that complicated, um, once you know the fundamental theories. It's like, and we talk about AIs in layers, each layer is exactly the same code in a for loop Where it gets complicated is that simplified code is extremely inefficient to run.

Uh-

Mehmet: Mm-hmm ...

Eugene: it's good for as a mathematical reference, but it's inefficient, and you want it to run efficiently on GPU. So that's where each line of the code in that code base becomes another thousand line of code. And but AIs are actually getting better and better at writing that. [00:25:00] So having the mathematical reference of the original.

So this is whether it is to write it in CUDA, HIP, HIP being for AMD GPU, or even Metal for, for Apple's, uh, line of laptop or any other form of compute. So what... As this increases, and it's already starting to happen, as people are trying to, like, optimize their AI models to run on their laptops or any mix of hardware, even on the CPU, the choices starts to open up.

And that That fundamental shift unlocks new hardwares and new possibilities. The other fundamental shift is the intelligence of AI is collapsing down towards smaller and smaller models, for example. Uh, one way I jokingly do this benchmark is take the highest-end MacBook Pro for th- for this year. I wouldn't call it cheap or consumer, but it's

Let's just use that as a [00:26:00] reference. What's the level intelligence of the best AI model you can run in that laptop? Yesterday, uh, or last year, uh, it was haha ChatGPT-3. Now, it's at Sonnet levels of performance, like, like not the very best AI model, but one step down. Next year, potentially it will be the best AI model that you see today running on a laptop, a high-end laptop, uh, given day.

And, and as that keeps happening, what you ... The best AI models you see today could be running on any hardware you can choose. Hmm. That choice helps you drive the, the market changes respectively. The third category, which is the wildcard category, new hardware players are hungry to jump in. We know about Cerberus, we know about Drop, which got acquired by NVIDIA.

We know about a [00:27:00] dozen of startups that are trying to enter the market with a new chip that is highly specialized for AI. They- that could change the market. I would say even Apple, which if I'm, if I can put, put a mouth to their ear and be like, "Hey, you guys have one of the best AI chip in the market. But if you package it as a GPU, a PCIe card, and sell that, you will just ship volumes."

But Apple won't do that because they're a consumer company.

Mehmet: Right.

Eugene: But they can. So these things can fundamentally change the market as it provides users more options.

Mehmet: Uh, just to, you know, uh, give, uh, context for, uh, for the audience, like I, I did a test and it was on an old laptop, but it had a GP- I mean, it's not very old.

It's like the, I think the, uh, the M2, uh, chip, the Apple Mac M2. It, it has a GPU built in. And yeah, I was able to run a, one of the open source [00:28:00] models. It's not like as fast as when you interact with maybe ChatGPT or any of the other tools, but I mean, still it worked because, you know, I proved the point, yeah, like you can, you can run it.

Of course, like it was like, uh, kind of scrolling, you know, like, uh, very slow, but it worked at the end. You know, like, uh, I, I even did something more, more crazy and I found an even older machine that it happened to have a, um, uh, I think a, a, um, you know, one of the oldest GPUs, and still I managed to, to get it, although, like the CPU is like very old.

It's a Celeron, believe it or not. But, but I- Yeah. But, but to your point, you know, what I'm trying to, to, to, you know, emphasize on what you said, Eugene, is that, you know, like things are getting to, to the edge, as we call it, and I think this is, you know, something very interesting. Now, from your point of view, um, when, when-- Because you are at intersection between, you know, hardware optimization and, and model research, so what kind of, you [00:29:00] know, breakthroughs you think that they are currently underrated?

Because we know the market, the VCs, everyone are looking, you know, at the top guys. But maybe you are seeing a few things that are really interesting and you think like this will be major breakthroughs that are coming very soon, and we're gonna see something coming out of them.

Eugene: So I have two versions of this.

One is the pragmatic breakthrough, which is not really a breakthrough. It's just more of like, as we understand the technology, um, a- and we improve on it The bottleneck in AI application today is not the technology itself, it's AI reliability.

Mehmet: Mm-hmm.

Eugene: And in my opinion, all the major labs are blindly chasing smarter and smarter models which are [00:30:00] not fixing reliability.

We probably know a friend or two that is really intelligent, but is not able to consistently, uh, uh, deliver their work on time. And that is essentially w- the problem with today's AI model. Right. The best AI model out of the box, unless they're specialized and fine-tuned, is not able to do, let's say, a DoorDash order or Uber order with more than 80% reliability.

And during that 20, 30% failure rate, it may order 10 pizzas instead of one. Right. Will you trust your credit card to someone who, who does that? And we're not... And these models, even the best one in open source, can do amazing feats like orbital physics calculation from Earth to Mars i- in, in seconds. I don't know how many of us can do that.

So there is that disconnect. It can do this m- major feaf, uh, uh, feats of intelligence, uh, [00:31:00] with, let's say, 10, 20% chance of success, but it's able to do that, but it's unable to do simple tasks with consistency. And when you ask commercial companies what do they want the AI model to do, most of the time they just want it to be consistent in a handful of tasks And that is why more and more, we see more and more companies customizing their AI models, fine-tuning, RLing the models.

It- and these can be even small models like 27B models to achieve that 99% consistency, which is required to put things into production. And that has been one of the more silent pushes. Like we've seen, like I mentioned, like we've seen Shopify, Cursor, uh, and, and various major companies slowly do this behind the back.

Uh, another recent one is AT&T also fine-tune their own set of models [00:32:00] for internal processes. And that is one major shift because it breaks free the dependency for extremely large models. The other one, which is from a more research side-

Mehmet: Mm-hmm ...

Eugene: is that I feel that we are fundamentally inefficient in the way we train models.

Mehmet: Mm.

Eugene: And for good reason. You hear people talk about AI models being trained trillions of parameters to gain a certain level of intelligence. We have proof, very strong proof that you do not need that level of tokens to train that level of intelligence. They're called human beings. We only give and take survive with, let's say, 100 years or two billion heartbeats.

So if I say we, for every second we live, we, it's one token, that's two billion tokens. Some people say we have [00:33:00] visual, audio and all that. Fine. Let's, let me be generous, 10 billion tokens. That is many orders of magnitude less tokens required to train an AI model. So very fundamentally speaking, we are doing something inefficient in the AI architecture, which we are resolving via brute force.

And to be fair, we really have brute force with the GPUs. So as innovations into how we do the data efficiency improve, there is those fundamental drops in efficiencies to train the models at scale, whether it be the amount of data required or even the size of the model A good example that I point to that potentially highlights the inefficiency of text model.

If you have seen image and video models, they are based on a different architecture line than the text model, and they're typically 10 to 20 billion parameter model, uh, par- parameters, and yet [00:34:00] they can illustrate entire worlds. And yet at the same time, we sometimes struggle to get text models of the same size to be consistent in its text.

It's fr- fundamentally on how the architecture and how we train them differently. That is also part of our research to, like, how do we make AI models more efficient- Mm-hmm ... and reliable at that scales.

Mehmet: Right. Eugene, there's one angle I think, uh, you know, we didn't discuss. Now, we, we're talking about open source models, right?

And we know they are improving. We discussed this, uh, and you said, like, how they would be better by time. Now, similar to the era when the open source software appeared and, you know, there was this debate, uh, in the enterprise, should we adopt, you know, open source or should we go, like, at a, you know, traditional, uh, proprietary software to deploy the [00:35:00] enterprise?

Of course, the debate, one of them is that's out of the shelf. If something happens, we talk, we talk to the developers or the company who developed that software. They have to solve it out. In case of open source, so either you have to have someone who's, you know, paying kind of support, quote, unquote, like I'm talking even about, like, uh, people who might remember when we start to adopt in, in the enterprise, like things like Red Hat and Ubuntu and, and all these things.

So yeah, we need support. The, the other way people looked at it, "Ah, I can get people who understand and, you know, because it's open source, I can modify it myself." So where do you see this debate? Like, uh, or, or what you're noticing, um, enterprise are doing when, when it comes to, to, to, to open source, how they think about it, like the trust, the, the ability to, you know...

Do, do, do they have to train also people internally so they can go and train [00:36:00] these models or modify them?

Eugene: So, uh, for those who have done B2B deals, uh, at these enterprises, uh, some of us may be familiar with the term the neck, uh, the throat to choke. Yeah. Yeah, and that's essentially the role that we sometimes play, uh, for these companies.

Uh, we will be there helping support them directly, helping them to, hey, uh, ensure that the systems are all up and running for the open source model and meet their requirements, testing, et cetera. So that is something that Featherless play as we do provide inference at scale, but it's something that pretty much any company that builds on top of open source model will play for the enterprises.

So we see that happening, that's for sure. Um, because new technology, people are scared of the unknown. They want someone they can talk to But w-what I find rather surprising as well [00:37:00] is more and more companies as, uh, turning over for consistency and reliability. And I don't mean like the G- the model reliability, I mean consistency from their provider, um, from, uh, from the closed source models.

Mm-hmm. And I say this mostly because, um, for example, from the existing closed source providers, the models are constantly being updated and deprecated on a monthly or even yearly basis. Uh, I have some enterprise customers who got very fed up that, "Hey, I just supported your model and you changed your model next month, and now I have to update my entire code base and platform and readjust it."

And this happens monthly. And they'll be like, "Can we stop that cycle?" Because if you are familiar enterprises, some of them prefer yearly cycles or even two yearly cycles, uh, how they treat their software development. And [00:38:00] the good thing about open models is if you do not want to upgrade, and you just download and want to use that one set of weights, it's forever the same.

And the other aspect is like we also seen enterprises now jumping on precisely because the current generation of open models have essentially caught up with Sonnet. Um, OpenAI themself, in one of their recent reports, admitted that 60% of their customers do not use the best model. That means- Mm ... for 60% of their customers, they can be fully replaced by the current generation of open models.

And for all those who are very concerned about their code base, agentic code being used to train the next generation of OpenAI models or, uh, uh, et cetera, suddenly these open source models allow them to overcome their privacy fears, because I can deploy this in their premise [00:39:00] and they can air gap it if they want, and they never need to worry about their data being leaked or trained in future models.

Mehmet: Mm-hmm.

Eugene: The third one is within the consumer space. Some is because the API keeps changing. We have seen incidents where, where like, uh, where some of the closed model, the latest models degrade in intelligence when under heavy load or because they didn't like a certain way you use the model, like in through OpenClaw and stuff like that.

They specifically detect certain patterns and block your usage or bill you extra And these things are frustrating to users because it, it's like, why are you dictating how we use the model at this level? I think the funniest one that was recently was because a certain line of agents was popular, the, uh, the Hermes agent, that they, that users were finding themselves when it, it was detected there was a Hermes, they were being billed extra.

[00:40:00] And a- and it was detected by just the text Hermes or MD. And a part of me was laughing at it because it was like, are you telling me potentially some developer in that Hermes bag company is suddenly being billed extra because someone decided that Hermes or MD is not something they want to support?

It's, it's just that kind of eccentricity that's like, "Hey, we just want thing, something that's more stable and controllable," that I actually get customers coming in when it happens. So, uh, yeah. I, I don't know why it happens. Version control is a soft thing. Uh, it's not- Yeah ... an AI problem. Th- these billion-dollar to trillion-dollar companies should be able to do basic version control, yet they are not.

That is a issue on its

Mehmet: own. Yeah. Uh, two things I just want to quickly follow up. The agent Hermes is, is amazing, at least for me. You know, I've been playing with it for now a [00:41:00] couple of days. Um, I like that it's a personal opinion, like it's more than Open Glow. Um, and the main thing is, you know, the ability to the, the, the better memory there.

The second thing, you know, which you kept repeating, and actually if people goes to the previous episode to this one, I had would be Helen Yu, and she talked a lot about that. Actually, she has a, a startup and all what she does, you know, with her team is about reliability of the AI, so I advise people to go and watch that.

Um, Eugene, as we almost come to an end, final maybe words you want to share with the audience, and of course, where people can get in touch.

Eugene: Yeah. Uh, feel free to reach out to me, uh, Eugene@Featherless.ai. Um, uh, also my socials are Pico Creator at Twitter. Um, I know what's this... Open source models are pretty much here.

Like, if you are dreaming of building [00:42:00] any application, you can already start doing so and start iterating from it. Yeah. And that is the big shift for the year. Like, it shockingly surprised me that even, let's say, the 27B or 31B, those small models that can run on laptop, laptops like this- Yeah ... like it's 128 gigs Can do entire applications just by prompting it.

We're not even talking about the top-of-the-line models like DeepSeek and et cetera. You can just build things. It's ex- it's an exciting world. Like, you don't need to know how to code. The... You can just start building things. And I explore-- I want to encourage people to explore that because that is the exciting future that we are in.

Mehmet: Right. Um, quick question for you, Eugene. If I want to connect my, uh, Hermes [00:43:00] agent with your API, will it ch- will it, um, choose the best one for me?

Eugene: So yeah, uh, we already have, uh, on our platform, uh, plans specifically for people who wants to run Hermes agents or Open Cloud. You can even, uh, host your Hermes agent on our platform.

We provide one-click support for hosting these models. We also provide access to some of the latest models, uh, be it DeepSeek, GIM, uh, Qwen, that, uh, does a really good job with these agentic tasks. And yeah, we, we encourage people to, to use it.

Mehmet: And it's a flat

Eugene: pricing- I,

Mehmet: I gonna try Yeah. I gonna try it. I g- and I encourage people to try.

So for people who, um, doesn't know what we're talking about, guys, if you... It's not a hype, I'm telling you. Like some people they sh- they say, "Yeah, this agentic thing is a hype." It's not a hype. I'm thinking to do a video just for this, and it's not like just to spread the knowledge because the amount of, you know, [00:44:00] time that I was able to save just, you know, in the past month between...

Of course, now I'm shifting from Open Cloud to, to, to Hermes. It's mind-blowing, you know? And I, I advise people to go. And of course you would need something like Featherless because you probably will not get the best speed if you run it locally unless, uh, you know, you can afford these these expensive GPUs.

Um, but, but yeah, like, uh, I advise everyone to do, to, to, to, to give it a try. Uh, Eugene, thank you very much. Really, you know, I like these discussions and I like, you know, anything which is on the frontier of, of, you know, anything AI. And I think, you know, um, everyone should, should, should learn these things because this is the future.

And this is how I end my episodes. This is for the audience. First of all, all the links to Eugene's social and the website, of course, they will be in the show notes if you're listening on your favorite podcasting app. If you're watching this on YouTube, they will be in the description. And if you just discovered us by luck, thank you for [00:45:00] passing by.

I hope you enjoyed. If you did so, give me a favor, just share it with as many people as you can and subscribe. And if you are one of the people who keeps coming again and again, thank you very much for your support, and thank you very much for what you're doing to the show. Des- despite, like, we changed some, uh, metadata in the show, still you...

we managed to stay in the Apple Top 200 podcast chart across multiple countries, but now in the technologies section after we shifted from entrepreneurship. So thank you very much for the support, and as I say always, stay tuned for a new episode very soon. Thank you. Bye-bye.

#601 The AI Bottleneck Is No Longer GPUs. It’s Energy and Memory | Eugene Cheah

Listen On

Featured Episodes

Recent Episodes

Support On