I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.
To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
RAradq2 小时前
Thank you for the kind words. We will write and share more of these.
RJrjzzleep2 小时前
Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:
First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
GAgardnr2 小时前
Different bubble than the one I was hoping for.
This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852
FRfragmede5 分钟前
That's a terrible name for that and I can't say that Hanlon's razor applies. Bubble that everyone's knowingly referring to is the stock market collapsing like in 2001. To choose a headline that can be mistaken for that just to get clicks is shit. You could've called it GPU-CPU pipeline stall, but no, you intentionally chose a name that would be confused for something else just to get clicks?
NLnl2 小时前
> you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.
This is true, but I've never heard anyone refer to this as a GPU bubble before.
I think most people hear "GPU bubble" and think of a financial bubble of some kind.
SCSCdF2 小时前
It appears to be a real term? https://docs.vulkan.org/tutorial/latest/Synchronization/Asyn...
Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
KIkibibu2 小时前
"bubble" used to be used a lot more when talking about very deep pipelines, eg Pentium 4 depth.
TUtux316 分钟前
Or in the case of my poor Verilog, even very short pipelines :(
UNunknown2 小时前
[deleted]
SPspaqin44 分钟前
Pretty sure that would be "[GPU performance] bottlenecked [by the CPU]" in most common terms.
_Z_zoltan_1 小时前
while the title is misreading, when reading GPU profiling data, we do call these bubbles - where the GPU _could_ do something, but it's idle.
any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.
VKvkazanov2 小时前
I saw it in literature on cpu pipelines in quotes, never without.
ISIshKebab1 小时前
I've never seen it in quotes, but yeah it is a very common term in pipelined CPUs.
CMcma2 小时前
It's very common to call it a GPU bubble in gamedev, though not strictly for CPU induced bubbles.
RUrusk2 小时前
The term I would use would be “underutilised”
BAbarries112 小时前
"stall" is the best term I can think of as in "pipeline stall".
Better term, anyone?
_Z_zoltan_1 小时前
it's not stalled, as that would imply that it waits for something, which is not necessarily the case with bubbles. most often it shows lack of proper pipelining or wrong pipeline dependencies (pipeline A waits for pipeline B, pipeline C waits for pipeline B, while pipeline B waits for an event X, now you've just made all three pipelines stalled on event X - not good).
RUrusk1 小时前
When an engine stalls, the implication is that the chain reaction that drives it is failing - I don’t think that is the case with a GPU as it will quite happily sit there drawing watts til you give it things. In systems nomenclature the inverse term for bubble is utilisation. This or that link is or node is using x% of its capacity. Indeed, if you monitor your GPU with nvidia-smi you will see that very term in the instrumentation.
NNnnevatie2 小时前
Yes, the title seems off - I also thought I am going to be reading about the AI/pricing bubble.
评论
5 条顶层评论请先登录 h4cker 账号,然后连接 Hacker News 后发表评论。
I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before. To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
Thank you for the kind words. We will write and share more of these.
Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about: First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda. Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
Different bubble than the one I was hoping for. This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852
That's a terrible name for that and I can't say that Hanlon's razor applies. Bubble that everyone's knowingly referring to is the stock market collapsing like in 2001. To choose a headline that can be mistaken for that just to get clicks is shit. You could've called it GPU-CPU pipeline stall, but no, you intentionally chose a name that would be confused for something else just to get clicks?
> you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble. This is true, but I've never heard anyone refer to this as a GPU bubble before. I think most people hear "GPU bubble" and think of a financial bubble of some kind.
It appears to be a real term? https://docs.vulkan.org/tutorial/latest/Synchronization/Asyn... Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
"bubble" used to be used a lot more when talking about very deep pipelines, eg Pentium 4 depth.
Or in the case of my poor Verilog, even very short pipelines :(
[deleted]
Pretty sure that would be "[GPU performance] bottlenecked [by the CPU]" in most common terms.
while the title is misreading, when reading GPU profiling data, we do call these bubbles - where the GPU _could_ do something, but it's idle. any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.
I saw it in literature on cpu pipelines in quotes, never without.
I've never seen it in quotes, but yeah it is a very common term in pipelined CPUs.
It's very common to call it a GPU bubble in gamedev, though not strictly for CPU induced bubbles.
The term I would use would be “underutilised”
"stall" is the best term I can think of as in "pipeline stall". Better term, anyone?
it's not stalled, as that would imply that it waits for something, which is not necessarily the case with bubbles. most often it shows lack of proper pipelining or wrong pipeline dependencies (pipeline A waits for pipeline B, pipeline C waits for pipeline B, while pipeline B waits for an event X, now you've just made all three pipelines stalled on event X - not good).
When an engine stalls, the implication is that the chain reaction that drives it is failing - I don’t think that is the case with a GPU as it will quite happily sit there drawing watts til you give it things. In systems nomenclature the inverse term for bubble is utilisation. This or that link is or node is using x% of its capacity. Indeed, if you monitor your GPU with nvidia-smi you will see that very term in the instrumentation.
Yes, the title seems off - I also thought I am going to be reading about the AI/pricing bubble.
I love the brand name, Moondream