Data classes for Structured Data
In C#, you'd reach for a record or class whenever a dictionary becomes unwieldy. Python developers often default to passing dict objects everywhere — which works until you lose track of which keys exist, typo a field name, or need IDE autocompletion. Python's @dataclass decorator gives you the expressiveness of a typed class with about half the boilerplate.
This lesson compares dataclasses against plain dicts and full classes, and shows how to structure the data types you'll use constantly in AI applications: message objects, request/response models, and configuration containers.
The Problem with Plain Dicts
# Common anti-pattern: passing dicts everywhere
def process_response(response: dict) -> dict:
# What keys does response have? Only the creator knows.
# What type is response["tokens"]? No IDE can tell you.
return {
"text": response["content"][0]["text"], # KeyError waiting to happen
"tokens": response["usage"]["output_tokens"],
"model": response.get("model", "unknown"),
}
# Caller can't know what to pass without reading the function
result = process_response({
"content": [{"text": "Hello"}],
"usage": {"input_tokens": 10, "output_tokens": 20},
})
print(result["text"]) # works today, breaks when keys change
The problems compound in AI systems: responses get passed through multiple functions, their shapes evolve as you add fields, and typos in key names only fail at runtime.
Dataclasses: The Solution
A dataclass automatically generates __init__, __repr__, and __eq__ methods from field definitions. It's Python's equivalent of a C# record.
public record ChatMessage(
string Role,
string Content,
DateTime Timestamp
);
public record LLMResponse(
string Text,
int InputTokens,
int OutputTokens,
string Model,
string FinishReason
);
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class ChatMessage:
role: str
content: str
timestamp: datetime = field(
default_factory=datetime.now
)
@dataclass
class LLMResponse:
text: str
input_tokens: int
output_tokens: int
model: str
finish_reason: str
What You Get for Free
from dataclasses import dataclass
from datetime import datetime
@dataclass
class LLMResponse:
text: str
input_tokens: int
output_tokens: int
model: str
finish_reason: str = "end_turn" # default value
# __init__ is generated automatically
response = LLMResponse(
text="Embeddings are dense vector representations...",
input_tokens=25,
output_tokens=42,
model="claude-3-5-sonnet-20241022",
)
# __repr__ is generated — useful for logging and debugging
print(response)
# LLMResponse(text='Embeddings are dense...', input_tokens=25, ...)
# __eq__ is generated — structural equality
r1 = LLMResponse("hi", 5, 3, "claude-3-5-sonnet-20241022")
r2 = LLMResponse("hi", 5, 3, "claude-3-5-sonnet-20241022")
print(r1 == r2) # True — compares field values, not identity
# Attribute access with IDE support and typo protection
print(response.input_tokens + response.output_tokens) # 67
print(response.input_tokes) # AttributeError — caught by IDE too!
Field Defaults and field()
For mutable defaults (lists, dicts), you must use field(default_factory=...) — the same reason C# warns against mutable default arguments:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ConversationState:
session_id: str
messages: list[dict] = field(default_factory=list) # NOT messages: list = []
metadata: dict = field(default_factory=dict)
total_tokens: int = 0
is_complete: bool = False
last_model: Optional[str] = None # Optional fields use Optional[T]
def add_message(self, role: str, content: str) -> None:
self.messages.append({"role": role, "content": content})
@property
def message_count(self) -> int:
return len(self.messages)
# Usage
state = ConversationState(session_id="sess_abc123")
state.add_message("user", "What is RAG?")
state.add_message("assistant", "RAG stands for Retrieval-Augmented Generation...")
state.total_tokens = 150
print(f"Session {state.session_id}: {state.message_count} messages, {state.total_tokens} tokens")
Frozen Dataclasses: Immutable Configuration
@dataclass(frozen=True) makes instances immutable — equivalent to a C# record with init-only properties. This is ideal for configuration objects that should not change after startup:
from dataclasses import dataclass
@dataclass(frozen=True)
class ModelConfig:
"""Immutable AI model configuration."""
model_name: str
max_tokens: int
temperature: float
system_prompt: str
def __post_init__(self) -> None:
"""Validation runs after __init__ — use for invariant checking."""
if not 0.0 <= self.temperature <= 1.0:
raise ValueError(f"temperature must be 0.0–1.0, got {self.temperature}")
if self.max_tokens < 1:
raise ValueError(f"max_tokens must be positive, got {self.max_tokens}")
# Config is set once and never mutated
PRODUCTION_CONFIG = ModelConfig(
model_name="claude-3-5-sonnet-20241022",
max_tokens=1024,
temperature=0.3,
system_prompt="You are a helpful AI assistant.",
)
# This raises FrozenInstanceError:
# PRODUCTION_CONFIG.temperature = 0.9
Dataclass vs Dict vs Class: When to Use Each
| Situation | Best Choice | Reason |
|---|---|---|
| Typed, named data container | @dataclass | IDE support, equality, repr, validation |
| Immutable config or value object | @dataclass(frozen=True) | Hashable, safe to share across threads |
| Schema validation + serialization | Pydantic model | Validators, JSON export, API contracts |
| Simple local data grouping | dict | When typing overhead isn't worth it |
| Behavior-heavy domain object | Regular class | When you need inheritance, properties, methods |
Practical AI Dataclasses
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import json
@dataclass
class RetrievedChunk:
"""A document chunk returned by a vector database query."""
chunk_id: str
document_id: str
text: str
score: float # cosine similarity score
metadata: dict = field(default_factory=dict)
@property
def is_relevant(self) -> bool:
return self.score >= 0.75
@dataclass
class RAGRequest:
"""A request to the RAG pipeline."""
query: str
top_k: int = 5
min_score: float = 0.7
filters: dict = field(default_factory=dict)
session_id: Optional[str] = None
created_at: datetime = field(default_factory=datetime.now)
@dataclass
class RAGResponse:
"""A complete response from the RAG pipeline."""
answer: str
source_chunks: list[RetrievedChunk] = field(default_factory=list)
input_tokens: int = 0
output_tokens: int = 0
latency_ms: float = 0.0
def to_dict(self) -> dict:
"""Serialize to dict for JSON storage or API response."""
return {
"answer": self.answer,
"sources": [
{"id": c.chunk_id, "score": c.score, "text": c.text[:200]}
for c in self.source_chunks
],
"tokens": {"input": self.input_tokens, "output": self.output_tokens},
"latency_ms": self.latency_ms,
}
# Usage in a pipeline
chunks = [
RetrievedChunk("c1", "doc_001", "RAG retrieves relevant context...", 0.92),
RetrievedChunk("c2", "doc_003", "Vector databases store embeddings...", 0.81),
]
response = RAGResponse(
answer="RAG combines retrieval with generation...",
source_chunks=[c for c in chunks if c.is_relevant],
input_tokens=230,
output_tokens=150,
latency_ms=1340.5,
)
print(json.dumps(response.to_dict(), indent=2))
Dataclasses are the right choice for internal data containers where you control both the producer and consumer. Once you cross an API boundary — receiving external JSON, validating user input, or defining an API contract — use Pydantic (covered in Lesson 1.2). Pydantic adds runtime validation, coercion, and JSON serialization that dataclasses lack.
Key Takeaways
@dataclassgenerates__init__,__repr__, and__eq__from field annotations — Python's equivalent of C#record- Use
field(default_factory=list)for mutable defaults — never usefield: list = []directly __post_init__runs after__init__and is the right place for validation and derived field computation@dataclass(frozen=True)creates immutable instances — ideal for configuration objects that should be shared safely- Prefer dataclasses over dicts for any structured data passed between functions — IDE support and early error detection are worth the added lines
- When you need validation and JSON serialization, upgrade to Pydantic — dataclasses are for internal structures, Pydantic is for API contracts

Comments
Loading comments…