Skip to main content

Command Palette

Search for a command to run...

Getting Real About Modern LLMs, GPUs, and Agents

How to actually understand today’s AI models well enough to build serious stuff with them.

Updated
11 min read
Getting Real About Modern LLMs, GPUs, and Agents

1. Why bother understanding this at all?

If you’re a developer or founder, you don’t need to reinvent the math of deep learning. But you do need a solid mental model of:

  • what modern LLMs really are,

  • why they’re trained on GPUs/TPUs,

  • how context windows, tokenization, and KV cache shape what’s possible,

  • what “SFT”, “RLHF”, and “agents” actually mean in practice.

With that, you can:

  • choose the right models and infrastructure,

  • design prompts and tools that cooperate with the model instead of fighting it,

  • avoid the common traps that waste tokens, time, and effort.

This post is that kind of primer: opinionated, practical, and aimed at people who want to build.


2. What a modern LLM really is

At its core, a large language model is:

A giant function that takes a sequence of tokens and predicts the probability distribution of the next token.

Three key pieces:

  1. Tokenization

    • Text is split into tokens (subwords, word pieces, sometimes whole words or symbols).

    • Each token is mapped to an integer ID.

    • Different tokenizers → different token counts → different cost and context usage.

  2. Embeddings and Transformer blocks

    • Each token ID → high-dimensional vector (embedding).

    • These vectors pass through a stack of Transformer blocks.

    • Each block has:

      • Self-attention: each token looks at other tokens and decides who matters.

      • Feed-forward network: a small neural net applied to each token’s representation.

      • Residual connections + normalization: keep things stable and trainable.

  3. Output layer

    • Final hidden state at each position → logits over the vocabulary.

    • After softmax, you get a probability distribution over the next token.

    • During generation, you sample repeatedly from that distribution.

The model doesn’t “think in words”. It operates entirely on vectors, but is trained in such a way that those vectors encode useful structure about language, code, and the world.


3. How LLMs are trained (end to end)

Training has three main phases: pretraining, post-training, and inference-time steering.

3.1 Pretraining: turning text into a base model

Pretraining is where the model learns most of its knowledge and raw capabilities.

Step 1: Data

  • Collect large corpora: web pages, books, documentation, code repositories, etc.

  • Clean and filter:

    • remove obvious junk and duplicates,

    • filter by language,

    • control for some types of toxic content.

  • Tokenize into integer sequences.

Step 2: Objective

The core pretraining task is next-token prediction:

  • Given: a sequence of tokens [x1, x2, ..., xT].

  • Task: predict token x2 from [x1], x3 from [x1, x2], and so on.

  • The loss function (cross-entropy) punishes the model when it assigns low probability to the actual next token.

Intuitively:

The model gets better by being less surprised by real text.

Over trillions of tokens, the easiest way to reduce surprise is to internalize grammar, common patterns, factual structure, and even rough reasoning shortcuts.

Step 3: Optimization

  • Run forward pass → compute predictions and loss.

  • Use backpropagation to compute gradients.

  • Use an optimizer (e.g. AdamW) to tweak weights slightly.

  • Repeat this at insane scale across GPU/TPU clusters.

At the end, you have a base model: great at modeling text, but not yet a polite assistant.

3.2 Why GPUs/TPUs instead of CPUs

Training is dominated by a small set of operations:

  • large matrix multiplications (matmul) for almost every layer,

  • simple elementwise operations (nonlinearities, normalization),

  • repeated over and over on huge tensors.

CPUs are designed for:

  • a small number of powerful, flexible cores,

  • low-latency, branching-heavy workloads (OS, databases, web servers).

GPUs/TPUs are designed for:

  • thousands of smaller, simpler cores,

  • massive parallelism,

  • very high memory bandwidth,

  • specialized matrix-multiply units (tensor cores).

If your job is essentially “multiply big matrices and do the same arithmetic on millions of numbers”, GPUs/TPUs will outperform CPUs by orders of magnitude in speed and energy efficiency.

Short version:

Pretraining LLMs is a giant dense linear algebra problem. GPUs/TPUs are built for that; CPUs are not.

3.3 Post-training: from raw model to assistant

Pretraining gives you a model that behaves like smart autocomplete. To turn it into a useful assistant, there are two major post-training steps.

Supervised fine-tuning (SFT)

  • Collect (instruction → good response) pairs.

  • Train the model again on this dataset.

  • Same next-token loss, but now:

    • inputs look like user prompts/questions,

    • outputs are curated, high-quality answers.

This teaches the model to:

  • treat the latest user message as a task to respond to,

  • answer clearly instead of just “continuing the text”,

  • follow patterns like “answer step-by-step” or “output valid JSON”.

This produces the familiar instruct/chat variants.

Preference optimization (RLHF, DPO, etc.)

Now you optimize for style and safety.

  • Collect data where humans compare two model responses (A vs B) and pick the better one.

  • Train a preference or reward model that predicts which response humans prefer.

  • Adjust the base model to produce responses that score better under this preference model.

Effects:

  • Reduces toxic and clearly harmful behavior.

  • Encourages helpful, polite, cautious answers.

  • Sometimes adds more hedging and verbosity than you personally might like.

3.4 Inference: how generation actually works

At inference time:

  1. Prompt is tokenized.

  2. Tokens are embedded and run through the Transformer stack.

  3. The model outputs a probability distribution over next tokens.

  4. A decoding strategy chooses the next token:

    • Greedy: always pick the most probable.

    • Top-k: sample from the k most likely.

    • Top-p (nucleus): sample from the smallest set whose total probability ≥ p.

    • Temperature: make the distribution sharper or flatter.

  5. Append that token and repeat.

Two key knobs for you as a user:

  • Sampling strategy → controls creativity vs determinism.

  • Stopping criteria → when to end (max length, special tokens, etc.).


4. Context windows, tokenization, and KV cache

You never interact with “the model in the abstract”. You interact with a specific model that has:

  • a tokenizer,

  • a maximum context window, and

  • performance tricks like KV cache.

All three matter a lot in practice.

4.1 Tokenization

Tokenization decides:

  • how many tokens your text turns into,

  • which sequences are “cheap” or “expensive”,

  • how efficiently code, emojis, non-English languages, etc. are represented.

Why you should care:

  • Token count drives latency and cost (on cloud APIs).

  • Different models can be dramatically more efficient for certain kinds of text (e.g. code-oriented tokenizers make code cheaper to process).

4.2 Context window

The context window is the maximum number of tokens the model can see in one go. It has to fit:

  • system prompt,

  • user messages,

  • chat history,

  • tools’ schemas and logs (if any),

  • your actual task content (code, docs, etc.),

  • and sometimes the output itself.

Implications:

  • Long conversations and large docs don’t fit in raw form.

  • You need strategies: summarization, retrieval, trimming.

  • For RAG, you’re always balancing “instructions + query + retrieved chunks” against a hard limit.

4.3 KV cache: making long generations feasible

Attention is quadratic in sequence length if you recompute it from scratch each time. KV cache is the standard trick to avoid that.

During generation:

  • For each token and each layer, the model computes keys (K) and values (V).

  • These are stored in a KV cache.

  • For each new token, the model:

    • reuses Ks and Vs for all previous tokens from the cache,

    • computes new Q/K/V for the current token.

This keeps the cost per new token reasonable and enables efficient streaming.

4.4 Long context is not perfect memory

Even with huge context windows (e.g. 128k tokens):

  • Attention is soft: tokens distribute their focus over many others.

  • Models tend to focus more on recent or salient tokens.

  • Training may be optimized more for shorter ranges.

Practical strategies:

  • Don’t just paste entire books or giant logs.

  • Summarize the core: goals, key facts, definitions.

  • Repeat crucial early information near where it’s actually needed.

  • For retrieval-based systems, feed the model small, relevant chunks instead of huge blobs.


5. How alignment and prompts interact

You can think of a modern assistant model as the sum of four layers:

  1. Pretraining → knowledge and general patterns.

  2. SFT → instruction-following behavior.

  3. Preference optimization → style and safety.

  4. System prompt → per-session steering.

When the model’s behavior feels wrong:

  • If it doesn’t understand the task at all → capability issue (model too small or wrong domain).

  • If it understands but ignores structure (e.g. you asked for JSON, it gives prose) → prompt design or SFT limitations.

  • If it refuses too much or over-hedges → RLHF/alignment layer.

  • If it usually behaves but slips on details → fix with better system prompts, clearer instructions, or examples.

System prompts are powerful but bounded:

  • You can set role, style, formatting rules, tool policies.

  • You can’t give it knowledge it doesn’t have.

  • You can’t fully override strong safety constraints.


6. Agents: LLMs that can act in a loop

“Agentic AI” sounds mysterious, but in practice an agent is just:

A loop where the model can plan → act (use tools) → observe → update its plan, instead of answering once and stopping.

6.1 Basic agent loop

A minimal agent loop looks like:

  1. Maintain a state:

    • user goal,

    • relevant context (files, docs, notes),

    • history of actions and observations.

  2. Call the model with the current state.

  3. The model returns either:

    • a tool call (e.g. read a file, run a command, fetch a URL), or

    • a final answer.

  4. If it’s a tool call:

    • the environment runs the tool,

    • the result is appended to the state,

    • loop continues.

  5. If it’s done or stuck:

    • the loop exits with a final answer or a failure report.

All popular “agent frameworks” are just structured versions of this idea.

6.2 Tools define what an agent can do

Examples of tools:

  • Code agent:

    • read_file(path)

    • search_files(pattern)

    • apply_patch(path, diff)

    • run_tests()

  • Research agent:

    • web_search(query)

    • load_pdf(path/url)

    • search_notes(query)

    • summarize_chunk(text)

  • Ops agent:

    • run_cmd(command)

    • deploy_service(config)

    • etc.

More powerful tools → more capability ↔ more risk.

6.3 Guardrails for agents

Because agents can operate in loops and take actions, you need guardrails:

  • max_steps per task (prevent infinite loops).

  • Clear termination conditions.

  • A way to give up gracefully:

    • “Here’s what I tried, what failed, and what’s left for a human.”
  • Tool policies:

    • Prefer patches over full rewrites.

    • Don’t run destructive commands.

    • Don’t fake tool outputs.

Agent behavior benefits from a good system prompt:

  • Tell it to check tool results before proceeding.

  • Tell it to ask for confirmation before irreversible actions.

  • Tell it how to report failures.

6.4 Coding agents vs research agents

Coding agent (e.g. in an IDE):

  • Goal: operate over a repo to implement features, fix bugs, refactor.

  • Typical loop:

    • inspect files → propose change → apply patch → run tests → inspect failures → iterate.
  • Needs to be conservative:

    • limit scope,

    • avoid unwanted refactors,

    • keep changes reviewable.

Research/analysis agent (e.g. in a web UI):

  • Goal: answer a complex question using multiple sources.

  • Typical loop:

    • clarify question → plan steps → fetch sources → summarize chunks → build hierarchical notes → synthesize answer with citations.
  • Works well with long context and many iterations, especially on local setups where you’re not billed per token.


7. How to get more out of LLMs in practice

With these concepts in mind, here are some practical principles.

7.1 Work with the model’s structure, not against it

  • Be aware of the token budget.

  • Put the most important requirements and constraints near the end of the prompt where they’re fresh.

  • Use summaries to keep long-term goals and context alive.

  • For large tasks, split into sub-tasks and use an agent-style loop instead of one giant prompt.

7.2 Use the right model for the job

  • Bigger, more capable models for:

    • difficult reasoning,

    • complex refactors,

    • high-stakes answers.

  • Smaller/local models for:

    • boilerplate code,

    • simple Q&A,

    • summarization,

    • experimentation where cost would otherwise explode.

7.3 When you move to agents, start conservative

  • Start with tools that can’t destroy much:

    • read-only operations,

    • patch-based writes,

    • whitelisted commands.

  • Add more powerful tools only when:

    • you have tests,

    • you trust the agent’s behavior,

    • and you understand the failure modes.

7.4 Iterate on system prompts

  • Treat your system prompts as code:

    • version them,

    • refine over time,

    • test them with real tasks.

  • Make them explicit about:

    • reasoning style,

    • output format,

    • safety constraints,

    • interaction with tools.


8. Closing thoughts

You don’t need to derive every formula behind Transformers to use LLMs effectively. But you do need a mental model like this:

  • LLMs are next-token predictors trained with huge matrix multiplications on GPUs.

  • Context windows, tokenization, and KV cache define the practical envelope of what you can do.

  • Post-training (SFT + alignment) and system prompts decide how the model behaves for you.

  • Agents are just loops around the model with tools, not magic.

Once you see things this way, you can make better decisions about:

  • which models to use,

  • how to structure your prompts and contexts,

  • when to introduce agents and tools,

  • and how to balance cloud vs local inference.

That’s where the real leverage is: understanding enough of the internals to design workflows that actually cooperate with the model and then letting the model do what it’s good at, while you focus on building things that matter.

More from this blog

Small Bets, Strong Systems

16 posts

I write field notes on using technology, systems and capital to build and run small, durable online businesses.