LLMs: It’s Just Math, Not Magic
Introduction
You’ve shipped a chat feature. Demos are great… until traffic spikes, latency doubles, and your GPU bill makes finance knock on your Slack. Product asks for “the smarter model,” while you’re staring at a Grafana dashboard wondering if “smarter” means “slower,” “more expensive,” or “a new class of production incidents.” Meanwhile, the team is debating whether to fine-tune or switch providers altogether.
This is the point where the sci‑fi framing of Large Language Models (LLMs) stops helping. Here’s the reality that keeps systems predictable: an LLM is a big stack of matrix multiplications trained to guess the next token. That’s it. There’s no soul, no secret reasoning organ—just an enormous collection of adjustable numeric knobs (parameters) wired together as transformer blocks that learned patterns from text. If you understand that, you’ll make better calls on model size, latency, memory, and cost.
Let’s whiteboard this like seasoned engineers and strip it down to data flows, knobs, and trade‑offs you can ship.
Breaking Down the Core Concept
At its core, an LLM is a deep neural network that estimates: “Given the tokens I’ve seen so far, what token most likely comes next?” It doesn’t retrieve facts; it doesn’t think. It predicts. This next‑token prediction is repeated many times to form a full response.
Under the hood:
- Parameters as adjustable logical knobs: Millions to tens of billions of numbers (weights) tuned during training. Each weight is like a tiny dial that collectively shapes how the model responds to patterns in text.
- Transformer blocks: Stacked layers that repeatedly apply two main operations—self‑attention (to figure out which parts of the context matter for the next prediction) and feed‑forward networks (to transform information nonlinearly).
- Attention stacks: Self‑attention lets the model compute how much each token should “pay attention” to every other token—so it can reuse long‑range context without forgetting earlier content.
- The training goal is narrow and simple: optimize those weights so the model gets better at next‑token prediction over massive text datasets.
This simplicity is empowering. Once you anchor on “predict next token,” you can reason clearly about memory, throughput, latency, parameter counts, base vs. aligned models, and why 8B vs. 70B has very different ops profiles.
How It Works: Step-by-Step
Let’s walk one full inference loop. We’ll avoid math symbols and talk data movement.
- Tokenize the text
- Input text is split into subword tokens (e.g., “transfor”, “mer”, “s”).
- This is a reversible mapping to integer IDs.
- Embed tokens
- Each token ID becomes a vector (embedding). Think: a numeric meaning space.
- Run through stacked transformer blocks
- Self‑attention: “Given all previous tokens, which ones are relevant to predict the next one?” It computes attention scores, then blends token information accordingly. A key optimization here is the KV cache, which stores Keys and Values for previous tokens so we don’t recompute attention from scratch for each step.
- Feed‑forward networks: A small neural network per token transforms the attended representation, letting the model express richer patterns.
- Project to logits
- A final linear layer projects the hidden state to a vector the size of the vocabulary. Each entry is a logit—the unnormalized score for a particular next token.
- Turn logits into a distribution and sample
- Apply softmax (and possibly temperature/top‑p/top‑k) to get probabilities.
- Select a token (greedy or sampled), append it to the context, and loop.
- Stream tokens to the client
- The app can stream partial outputs as they are generated for snappier UX.
Here’s a visual map of those steps.
sequenceDiagram
participant App as Your App
participant Tok as Tokenizer
participant Emb as Embedding Layer
participant KV as KV Cache
participant Att as Attention Stack (Transformer Blocks)
participant FF as Feed-Forward (MLP)
participant Proj as Logit Projector
participant Samp as Sampler (Softmax/Top-p)
participant UX as Stream to UI
App->>Tok: Input text
Tok-->>App: Token IDs
loop For each generation step
App->>Emb: Token IDs
Emb-->>Att: Token embeddings
Att->>KV: Read+Write K/V (reuse past context)
Att-->>FF: Context-weighted representations
FF-->>Proj: Transformed hidden states
Proj-->>Samp: Logits (scores for all vocab tokens)
Samp-->>App: Next token ID
App-->>UX: Stream token to user
App->>Tok: Append token to context
end
The “neural endpoint” heavy lifting is the Attention+MLP compute across all layers at each step. Everything else is relatively cheap.
The Whiteboard Analogy
Picture a hyper‑read library assistant. This assistant has scanned an unimaginable number of books and articles. They don’t “understand” in a human sense. Instead, they’re uncanny at telling you which word likely comes next based on patterns they’ve seen.
- If you say, “In a REST API, a 404 means…”, the assistant has seen that sequence a million times and knows “Not Found” is a strong completion.
- If you add more context (“…in the context of CDN edge caching”), they pivot and bias toward answers frequent in that niche.
The assistant’s “skill” is learned statistics. Inside the model, the adjustable knobs are tuned so that when it sees a particular pattern of tokens, certain follow‑ups are more probable. No magic. Just an industrial‑scale autocomplete with guardrails (if aligned) and impressive generalization.
Architectural Trade-offs & Comparisons
Choosing between model sizes is an engineering trade‑off: memory, latency, throughput, and quality. Here’s how to think about it.
|
Choice |
Complexity |
Latency Profile |
Memory Footprint |
Throughput Considerations |
Best-Fit Use Cases |
|---|---|---|---|---|---|
|
Small (7–8B) |
Lower operational complexity; can run on a single high‑end GPU or even CPU for dev |
Fast first token; typically tens of tokens/sec on modest GPUs; good tail latency |
~16 GB in FP16; ~8 GB in 8‑bit; ~5–6 GB in 4‑bit (estimates vary) |
Higher concurrency on a single node; cheaper autoscale |
Chat UX, classification, structured extraction, short reasoning with strong prompts and RAG |
|
Medium (13–34B) |
Moderate complexity; often single GPU with enough VRAM or multi‑GPU for headroom |
Slower than 8B but usually acceptable for enterprise chat |
~26–70+ GB FP16; quantization helps significantly |
Lower concurrency per node; careful batching needed |
Higher‑quality summarization, coding hints, more robust multilingual |
|
Large (65–70B) |
Higher complexity; likely needs multi‑GPU, model sharding, and tuned schedulers |
Noticeably higher latency without aggressive KV caching/batching |
~140 GB FP16; aggressive quantization and tensor/PP sharding required |
Throughput can be good with batching but higher per‑request cost |
Hard problems: nuanced reasoning, complex code assist, long‑context synthesis |
Parameter count isn’t the whole story. Data quality, training recipe, context length, and alignment all matter. But parameter count correlates with capacity and inference cost.
Why This Matters to Your Stack
Operationally, next‑token prediction drives everything you care about:
- Memory: Larger models store more weights and produce larger KV caches per request. That impacts how many concurrent chats you can serve on a GPU.
- Latency: Each token needs multiple transformer blocks to fire. More parameters = more FLOPs = slower tokens unless you scale horizontally and batch well.
- Cost: Bigger models burn more GPU time per token. Multiplying by user traffic turns “neat” into “ouch” quickly.
- Alignment vs. Base: Base models are raw next‑token predictors. Aligned models (instruction‑tuned, RLHF) are trained to follow instructions safely with more predictable formats. That affects prompt length, guardrails, and post‑processing you must build.
Understanding these basics lets you pick the right model class and size, set realistic SLAs, and allocate infrastructure (autoscaling policies, caching, prompt budgets) with eyes open.
Production Implementation & Use Cases
Where this shows up in real systems:
- RAG copilots: Use a fast 8–13B model with retrieval for most tasks; failover or “escalate” to 70B for ambiguous or critical questions.
- Code assistants: Medium/large aligned models for quality; cache frequent completions; run small models for classification/extraction around the edges.
- Contact center summarization: 8B aligned model with strict system prompts; low temperature; streaming to keep agents unblocked.
- Data cleaning and extraction: Base model fine‑tuned on your schema, or aligned model with ultra‑tight prompts and JSON mode.
- Safety & compliance: Aligned models reduce risk of unsafe outputs; still enforce content filters and policy checks server‑side.
Tactics that matter in production:
- KV cache reuse and prompt truncation to control tail latency.
- Token streaming for perceived responsiveness.
- Quantization and tensor parallelism to make big models financially viable.
- Request routing: small‑model default, big‑model on demand.
- Observability: tokens in/out, time‑to‑first‑token (TTFT), tokens/sec, and saturation of KV memory per GPU.
Hands-On: Minimal Python Inference Simulation
We’ll simulate a toy next‑token predictor with a token‑to‑weights dictionary. This is not a neural net—it’s just a frequency lookup to mirror the “it’s just prediction” concept. The spot where we do a simple lookup is where a real LLM would run massive attention+MLP math.
# Minimal next-token simulation using frequency-based weights
# Purpose: show "next-token prediction" without any neural net.
# Heavy lifting in real LLMs (attention+MLP) is replaced here by a simple lookup.
import math
import random
from collections import defaultdict
# A tiny "vocabulary"
V = ["is", "the", "a", "system", "cat", "runs", "fast", "reliable", "model", "is", "good", "bad", ".", ","]
# Raw frequency-like weights conditioned on the previous token
WEIGHTS = {
"is": {"the": 60, "a": 30, "system": 20},
"the": {"system": 25, "cat": 15, "model": 20, "is": 10},
"a": {"model": 20, "system": 10, "cat": 10},
"system": {"is": 40, "runs": 15, "reliable": 12},
"model": {"is": 50, "runs": 5},
"is": {"good": 30, "bad": 10},
"runs": {"fast": 25, ".": 10},
"fast": {".": 15, ",": 5},
"reliable": {".": 12},
"good": {".": 20, ",": 3},
"bad": {".": 12},
"cat": {"runs": 12, "is": 5},
}
def softmax_from_counts(counts, temperature=1.0):
# Convert counts to logits; use log(count) to compress extreme differences
logits = {tok: (math.log(c) if c > 0 else -1e9) / max(temperature, 1e-6)
for tok, c in counts.items()}
maxlog = max(logits.values())
exp = {tok: math.exp(v - maxlog) for tok, v in logits.items()}
Z = sum(exp.values()) or 1.0
return {tok: v / Z for tok, v in exp.items()}
def next_distribution(prev_tok, temperature=1.0):
counts = WEIGHTS.get(prev_tok, {})
if not counts:
# Unknown context: back off to a flat-ish distribution
counts = {tok: 1 for tok in V if tok not in ["is"]}
return softmax_from_counts(counts, temperature)
def sample(probs):
r = random.random()
cum = 0.0
for tok, p in sorted(probs.items(), key=lambda x: -x[1]):
cum += p
if r <= cum:
return tok
return list(probs.keys())[-1]
def generate(max_tokens=15, temperature=0.8):
seq = ["s"]
for _ in range(max_tokens):
prev = seq[-1]
probs = next_distribution(prev, temperature)
# In a real LLM, this is where the transformer stack computes logits.
# We replace that heavy neural compute with a simple lookup/sampling.
nxt = sample(probs)
seq.append(nxt)
if nxt == ".":
break
return " ".join([t for t in seq if t not in ["is"]])
if __name__ == "__main__":
random.seed(3)
print(generate(max_tokens=20, temperature=0.7))
What to notice:
- The “model” only predicts the next token using a conditional distribution.
- Temperature changes how peaky vs. diverse the output is.
- Our toy uses dictionary lookups; a real LLM calculates those distributions via attention stacks and large matrix multiplies on GPUs.
Choosing Between 8B and 70B: A Practical Sizing Playbook
Here’s the exact calculus I use when picking a model for an API.
- Start with latency and concurrency constraints
- If you need TTFT < 300 ms and p95 latency < 1.5 s under concurrency, an 8B (quantized) on a single GPU can be viable with streaming.
- A 70B will demand heavier infra (multi‑GPU, sharding) and professional‑grade batching to hit similar SLOs.
- Measure content difficulty before jumping sizes
- Many enterprise tasks (classification, extraction, short Q&A) are solved by a tuned 8–13B + RAG.
- Escalate to 70B for multi‑step reasoning, nuanced instructions, or long‑context synthesis when quality deltas are obvious in evals.
- Count memory twice: weights and KV cache
- Weights: ~16 GB for 8B in FP16; ~140 GB for 70B in FP16. Quantization can cut this 2–4x.
- KV cache: grows with sequence length and layer count. This silently caps how many concurrent requests a GPU can handle.
- Price the steady state
- Tokens/sec per dollar matters more than one‑off benchmarks. Measure on your prompts with streaming enabled and typical output lengths.
- Mix a fast default (8B) with “boost to 70B” on hard prompts. Route based on heuristics or eval confidence.
- Prefer aligned models for user‑facing flows
- They follow instructions better and require less prompt scaffolding, reducing total tokens and retry loops—often faster and cheaper in aggregate.
A common win: 8B aligned model for 80–90% of traffic, automatic “escalate to 70B” path for the rest. Your p95 stays sane, and you only pay big‑model costs when needed.
Base vs. Aligned Models: Operational Boundaries
Base = raw next‑token predictor. Aligned = trained to follow instructions safely (instruction‑tuned, RLHF, DPO, etc.). Know the boundaries to avoid surprises.
|
Dimension |
Base Model (e.g., “foundational”) |
Aligned/Instruction-Tuned Model |
|---|---|---|
|
Input Contract |
Expects continuation of text; less sensitive to system prompts. |
Expects instructions with roles; follows system/user formatting more reliably. |
|
Output Style |
May produce continuations, quotes, or off‑format text. |
Tends to answer directly, often with safer, concise formatting. |
|
Safety & Guardrails |
Minimal by default; higher risk of unsafe continuations. |
Better refusals, safer content defaults, but still needs external policy checks. |
|
Determinism Under Constraints |
More variance; prompt engineering required to constrain. |
More predictable with low temperature and structured prompts. |
|
Tool/Function Calling |
Needs explicit fine‑tuning or exemplars to follow strict JSON. |
Often supports structured output modes; easier function calling. |
|
Best Fit |
Pretraining for specialized fine‑tunes; open‑ended generation. |
User‑facing chat, workflows needing instruction following and safe style. |
|
Not Guaranteed |
Polite formatting, safe phrasing, stable JSON. |
Perfect adherence to strict schemas or domain facts without RAG. |
Observability, Caching, and Latency Control in Production
To keep throughput healthy and bills from spiking, pay attention to:
- Metrics: TTFT, tokens/sec generation, prompt+completion token counts, rejection/timeout rates, GPU memory (weights + KV cache) utilization.
- Caching: Prompt‑prefix caches and response caches for repeated system prompts and boilerplate instructions.
- Batching and scheduling: Group similar requests to improve GPU utilization; enforce max output tokens to cap tail latencies.
- Prompt budgets: Truncate context aggressively; use retrieval to stay under context windows.
- Backpressure: Shed load or degrade to smaller models when GPUs saturate.
- AB testing and evals: Use task‑specific evals to justify 70B escalations; track cost/quality trade‑offs per cohort.
C# Production Snippet: Routing to the Right Model Class
We’ll use an OpenAI‑compatible ChatClient to route between an 8B aligned model for speed and a 70B aligned model for hard prompts. Many self‑hosted stacks (vLLM, TGI, Ollama) expose an OpenAI‑compatible API; this snippet assumes such a gateway. Focus is on model selection, streaming, and measuring latency.
// NuGet: dotnet add package OpenAI
// This example uses the official OpenAI .NET SDK pattern with a ChatClient.
// It targets any OpenAI-compatible endpoint (self-hosted or provider) exposing /v1/chat/completions.
using System.Diagnostics;
using OpenAI;
using OpenAI.Chat;
public static class LlmRouter
{
// These model IDs are examples frequently used with OpenAI-compatible gateways.
private const string SmallModel = "meta-llama-3.1-8b-instruct";
private const string LargeModel = "meta-llama-3.1-70b-instruct";
// Point to your gateway: could be local vLLM/TGI or a cloud proxy.
private static OpenAIClient CreateClient(string apiKey, string baseUrl) =>
new OpenAIClient(new OpenAIClientOptions
{
ApiKey = apiKey,
BaseAddress = new Uri(baseUrl) // e.g., http://localhost:8000/v1 or your provider
});
public static async Task AskAsync(string userPrompt, int estimatedDifficulty, CancellationToken ct = default)
{
// Heuristic: route easy prompts to 8B, hard ones to 70B.
// You can replace with eval scores, uncertainty signals, or business tier.
var model = estimatedDifficulty <= 3 ? SmallModel : LargeModel;
var client = CreateClient(
apiKey: Environment.GetEnvironmentVariable("OPENAI_API_KEY") ?? "sk-local",
baseUrl: Environment.GetEnvironmentVariable("OPENAI_BASE_URL") ?? "http://localhost:8000/v1"
);
// Create a chat client for the chosen model.
var chat = client.GetChatClient(model);
var messages = new[]
{
new SystemChatMessage("You are a concise, safe enterprise assistant. Answer with JSON when asked for structured data."),
new UserChatMessage(userPrompt)
};
var options = new ChatCompletionOptions
{
Temperature = 0.2, // Favor deterministic, instruction-following outputs
MaxOutputTokens = 300, // Cap tail latency
TopP = 0.9,
// You can also set response_format / tools if your gateway supports it
};
var sw = Stopwatch.StartNew();
var content = new System.Text.StringBuilder();
// Stream tokens for responsive UX.
await foreach (var update in chat.CompleteStreamAsync(messages, options, ct))
{
if (update.ContentUpdate is { } partial)
{
foreach (var part in partial)
{
if (part is TextContent text)
{
content.Append(text.Text);
// In a real API: flush partial tokens to the client via SSE/WebSocket.
}
}
}
}
sw.Stop();
Console.WriteLine($"Model: {model} | TTFT+gen: {sw.ElapsedMilliseconds} ms | chars: {content.Length}");
return content.ToString();
}
}
// Usage:
// var answer = await LlmRouter.AskAsync("Summarize our on-call policy in 5 bullets.", estimatedDifficulty: 2);
// var deep = await LlmRouter.AskAsync("Refactor this distributed lock design and justify trade-offs.", estimatedDifficulty: 7);
What this demonstrates:
- Model selection is just a string—make it a first‑class input to your router.
- Low temperature + token caps keep latency predictable.
- Streaming gives users responsiveness even on larger models.
- You can plug any OpenAI‑compatible backend here (self‑hosted or provider) to compare 8B vs. 70B in your environment.
Pitfalls, Edge Cases, and Final Takeaways
Common pitfalls:
- Over‑prompting: Giant system prompts devour context and slow every call. Cache and shrink them.
- Misreading parameter counts: 70B ≠ automatic truth. If your task is retrieval‑heavy, invest in RAG quality before upgrading model size.
- Ignoring KV cache memory: You’ll hit strange OOMs under concurrency if you don’t budget for it.
- Treating base models like aligned: You’ll get weird continuations and brittle formats; use the right class for the job.
Final takeaways:
- LLMs are next‑token predictors—industrial autocomplete with lots of adjustable knobs. No magic, just math.
- Model size is a systems decision: memory, latency, throughput, and cost versus quality uplift.
- Base vs. aligned dictates how much scaffolding your app needs.
- A practical, cost‑aware pattern: 8B for the bulk, escalate to 70B for the hard stuff, measure relentlessly.
If you remember only one thing: the endpoint isn’t thinking; it’s scoring the next token. Design your stack—with routing, caching, and alignment—in service of that simple, powerful fact.