Notes · Dissecting Real Systems

growing

The Widest Box Is the Bug

A flamegraph turns thousands of stack samples into one picture where the slow code is, almost literally, the biggest thing on the screen.

Yesudeep Jose Mangalapilly

Published 7 days ago · Updated 2 days ago · 9 min read

profiling, performance, flamegraphs, observability, dissecting-systems

Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.

— Rob Pike, Notes on Programming in C (1989), Rule 2

Cite this

APA

Mangalapilly, Y. J. (2026, June). The Widest Box Is the Bug. Saṃhitā Notes. https://yesudeep.com/blog/the-widest-box-is-the-bug/

BibTeX

@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The Widest Box Is the Bug},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-widest-box-is-the-bug/},
          urldate = {2026-07-01},
        }

Plain

Yesudeep Jose Mangalapilly. “The Widest Box Is the Bug.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-widest-box-is-the-bug/.

RIS

TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The Widest Box Is the Bug
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-widest-box-is-the-bug/
        Y2  - 2026-07-01
        ER  -

How to read the one visualization that makes a profile legible. By the end you'll know what a flamegraph's axes actually mean (one of them is not what everyone assumes), why width is the only thing you need to look at, and how to find the slow path in seconds — with an interactive one on this page you can click into.

You have a program that's too slow, and you have a profiler's output: tens of thousands of stack samples, each a snapshot of "here's the call stack at the instant I looked." Buried in that pile is the answer to where the time went, but as a list of numbers it's unreadable. The flamegraph, which Brendan Gregg released in December 2011, is the picture that makes the pile legible — and its central virtue is almost comically simple.

In a flamegraph, the slow code is the widest box. Finding the bottleneck becomes finding the biggest rectangle.

What a flamegraph is

A profiler that samples doesn't trace every call; it interrupts the program many times a second and records the current call stack. After a while you have a huge bag of stacks. Some call paths show up in thousands of samples (the program spent a lot of time there); some show up in a handful (it barely visited). A flamegraph aggregates that bag into one shape.

Each box is a stack frame — a function. A box sits on top of the function that called it, so height is call depth: main at the bottom, the deepest helper at the top. And the width of a box is the thing that matters: it's proportional to how often that frame appeared in the samples. Gregg's own description is exact:

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is[,] the more often it was present in the stacks.

Imagine taking thousands of candid photos of a kitchen during dinner service, at random moments. Later you sort the photos by who's in them. If the pastry chef appears in most of the pictures, they were busy most of the night — not because you timed them, but because they kept showing up. A flamegraph is that stack of photos: the wider a cook's column, the more often they were caught working.

The axis everyone misreads

Here is the single most common misunderstanding, and clearing it up is half of learning to read these. The x-axis is not time. It looks like a timeline — a wide bar stretching left to right — but it isn't one. The horizontal direction is just the sample population sorted alphabetically by function name, so that identical call paths sit adjacent and merge into one wide box.

Sampling profiler — one that interrupts a program at a fixed frequency (say 99 times a second) and records the call stack, rather than instrumenting every function entry and exit. Cheap enough to run on production; statistical rather than exact. Learn more.

Why it matters: you cannot read left-to-right as "first this happened, then that." A frame on the far left isn't earlier than one on the right; it's just earlier in the alphabet. What you can read is width as proportion of total time, and the top edge as what was actually running. As Gregg puts it, "the top edge shows what is on-CPU." Everything below a top-edge frame is its ancestry — the chain of callers that led there.

One disambiguation before the rule hardens: some tools draw a sibling picture called a flame chart — Chrome DevTools' performance panel and speedscope's time-order view among them — where the x-axis is time, one stack per instant, nothing merged. Same flames, different geometry. The tell is merging: if identical call paths collapse into one wide box, width is share of samples and the axis is not time; if the same function appears again and again along the row, you're looking at a timeline.

Reading one: find the plateau

So the technique is almost embarrassingly direct. Scan for the widest boxes, especially wide boxes near the top — a wide plateau at the top edge is a function that was on-CPU in a large fraction of samples, which is to say, your hot spot. Narrow spikes are cheap, however deep they go. You are hunting for breadth, not depth.

Here is a flamegraph of a build — analysis versus execution versus linking, the structure the build system series has been about. The widths are illustrative, not a measurement of a specific run, but the shape is true to where build time generally goes. Click a box to zoom into its subtree; click a breadcrumb to zoom back out.

A build profile as a flamegraph. Width is share of total time; height is call depth. The widest top-edge box — codegen inside compile — is where the time goes. Click to zoom; illustrative widths.

Look at it the way the technique prescribes and the answer jumps out: codegen, nested under compile under execution, is the widest box on the top edge. It doesn't matter that analysis sits to its left or that link is a sibling — width is the whole story, and codegen is the widest. If you had an afternoon to make this build faster, you'd spend it there, and the picture told you so in one glance.

You don't read a flamegraph left to right. You read it widest-first.

Why it's a tree, and why that's the right shape

A flamegraph is exactly a tree of frames, each carrying its own sample count, where a parent's total is its own samples plus all its children's. That structure is not incidental — it's why the layout works. To size every box you compute, for each node, the sum of itself and everything beneath it, then divide the parent's width among its children in proportion. Computing "the total weight of every subtree" in one pass over a tree is a textbook fold (a catamorphism), which is exactly how the renderer on this page builds it.

The flamegraph above is drawn by the site's own charting toolkit: a samhita Tree of frames, totaled with a single tree.fold, packed into rectangles, and made zoomable by a small state machine. The same fold that sizes a flamegraph also sizes an icicle or a profiling timeline.

The same tree, the same fold, gives you the close cousins for free: flip it and the root sits at the bottom (the classic flame shape); leave the root on top and it's an icicle; lay the children along a time axis instead of packing them and it's a profiling timeline. One data structure, several views — because once you have a weighted tree, where the weight is is the only question any of these pictures answers.

The honest limits

A flamegraph is a lens, not an oracle. Three things to keep in mind.

Lockstep / skew — sample on the same beat as a repeating task and you photograph it at the same point in its cycle every time: a job that runs every 10 ms, sampled every 10 ms, is caught in every frame or in none, so its width reads 100% or 0% — both wrong. Sampling at 99 Hz drifts the camera off the beat, so the samples sweep the whole cycle fairly. Learn more.

It's statistical. Sampling is fair in expectation — a function collects samples in proportion to the CPU time it uses, however briefly each visit runs — but the estimates have variance (a rare-but-real cost needs many samples to show its true width), and work synchronized with the sample clock skews the count. That lockstep hazard is why the convention is 99 Hz rather than 100 — in Gregg's words, "to avoid accidentally sampling in lockstep with some periodic activity, which would produce skewed results." Widths are good estimates, not audited totals.
A hot function split across many callers hides. Frames merge only when their entire ancestry matches, so a function called from forty sites — memcpy, an allocator, a serializer — appears as forty narrow boxes that never look wide, and the widest-box scan misses the biggest total in the profile. The remedy is the inverted (bottom-up) flame graph, which merges leaves first, or a per-function aggregate table. When the widest box looks innocent, invert the graph before concluding the profile is flat.

The hidden hot function. Top-down, the widest box is layout and memcpy is six narrow slivers scattered across the graph. Press inverted to merge frames leaf-first — the slivers fuse into one box over half the total width, and each inverted box's children are its callers. Same samples, different merge.
Off-CPU time is invisible by default. A standard CPU flamegraph only shows time spent running. A program slow because it's waiting — on a lock, a disk, a network call — shows up as nothing at all, because no CPU samples land there. That needs a separate "off-CPU" flamegraph.
It shows where, never why. The widest box tells you which function to look at; it can't tell you that the function is slow because of a bad algorithm, a cache miss, or a pathological input. The flamegraph ends the search and hands you to the debugger.

The whole idea, in one box

Strip everything away and a flamegraph is a single trick: take an unreadable pile of stack samples, fold it into a weighted tree, and draw the tree so that weight becomes width. After that, optimization starts where your eye already went — the biggest rectangle on the screen.

That's why it's become the default picture of performance work. Not because it's clever to draw, but because it moves the answer from "buried in tens of thousands of numbers" to "the widest thing you're looking at." The slow code stops hiding.

Lessons

A flamegraph aggregates sampled stacks into a tree of frames: height is call depth, width is share of time.
The x-axis is not time — it's the sample population sorted alphabetically so identical paths merge. Don't read it left to right. (A flame chart — DevTools' timeline view — is the sibling whose x-axis is time; the tell is whether identical paths merge.)
Read widest-first: the widest box on the top edge is the hot spot. Depth is cheap; breadth is the bottleneck. When a hot function is split across many call sites, no box looks wide — invert the graph before trusting a flat-looking profile.
It's a weighted tree sized by a fold — which is why icicles and profiling timelines are the same structure under different layouts.
Limits: sampling is statistical; off-CPU (waiting) time is invisible without a separate graph; it shows where, never why.

References

Brendan Gregg. “Flame Graphs.” — the original, with the canonical reading rules
“The FlameGraph toolkit.” — and Linux perf — generating them from real profiles
“Catamorphisms.” — the fold that totals every subtree in one pass
“The Build Is Proportional to the Change.” — the system the example profiles

How to cite

APA

Mangalapilly, Y. J. (2026, June). The Widest Box Is the Bug. Saṃhitā Notes. https://yesudeep.com/blog/the-widest-box-is-the-bug/