● Complete Learning Series Python & AI Engineering for .NET Developers

Data classes for Structured Data

Tech Buddy June 15, 2026 3 min read

Part of Complete Learning Series Python & AI Engineering for .NET Developers

All lessons

In C#, you'd reach for a record or class whenever a dictionary becomes unwieldy. Python developers often default to passing dict objects everywhere — which works until you lose track of which keys exist, typo a field name, or need IDE autocompletion. Python's @dataclass decorator gives you the expressiveness of a typed class with about half the boilerplate.

This lesson compares dataclasses against plain dicts and full classes, and shows how to structure the data types you'll use constantly in AI applications: message objects, request/response models, and configuration containers.

The Problem with Plain Dicts

# Common anti-pattern: passing dicts everywhere
                      def process_response(response: dict) -> dict:
                          # What keys does response have? Only the creator knows.
                          # What type is response["tokens"]? No IDE can tell you.
                          return {
                              "text": response["content"][0]["text"],   # KeyError waiting to happen
                              "tokens": response["usage"]["output_tokens"],
                              "model": response.get("model", "unknown"),
                          }
                      
                      # Caller can't know what to pass without reading the function
                      result = process_response({
                          "content": [{"text": "Hello"}],
                          "usage": {"input_tokens": 10, "output_tokens": 20},
                      })
                      print(result["text"])  # works today, breaks when keys change

The problems compound in AI systems: responses get passed through multiple functions, their shapes evolve as you add fields, and typos in key names only fail at runtime.

Dataclasses: The Solution

A dataclass automatically generates __init__, __repr__, and __eq__ methods from field definitions. It's Python's equivalent of a C# record.

C# Record

public record ChatMessage(
                          string Role,
                          string Content,
                          DateTime Timestamp
                      );
                      
                      public record LLMResponse(
                          string Text,
                          int InputTokens,
                          int OutputTokens,
                          string Model,
                          string FinishReason
                      );

Python Dataclass

from dataclasses import dataclass, field
                      from datetime import datetime
                      
                      @dataclass
                      class ChatMessage:
                          role: str
                          content: str
                          timestamp: datetime = field(
                              default_factory=datetime.now
                          )
                      
                      @dataclass
                      class LLMResponse:
                          text: str
                          input_tokens: int
                          output_tokens: int
                          model: str
                          finish_reason: str

What You Get for Free

from dataclasses import dataclass
                      from datetime import datetime
                      
                      @dataclass
                      class LLMResponse:
                          text: str
                          input_tokens: int
                          output_tokens: int
                          model: str
                          finish_reason: str = "end_turn"   # default value
                      
                      # __init__ is generated automatically
                      response = LLMResponse(
                          text="Embeddings are dense vector representations...",
                          input_tokens=25,
                          output_tokens=42,
                          model="claude-3-5-sonnet-20241022",
                      )
                      
                      # __repr__ is generated — useful for logging and debugging
                      print(response)
                      # LLMResponse(text='Embeddings are dense...', input_tokens=25, ...)
                      
                      # __eq__ is generated — structural equality
                      r1 = LLMResponse("hi", 5, 3, "claude-3-5-sonnet-20241022")
                      r2 = LLMResponse("hi", 5, 3, "claude-3-5-sonnet-20241022")
                      print(r1 == r2)  # True — compares field values, not identity
                      
                      # Attribute access with IDE support and typo protection
                      print(response.input_tokens + response.output_tokens)  # 67
                      print(response.input_tokes)  # AttributeError — caught by IDE too!

Field Defaults and field()

For mutable defaults (lists, dicts), you must use field(default_factory=...) — the same reason C# warns against mutable default arguments:

from dataclasses import dataclass, field
                      from typing import Optional
                      
                      @dataclass
                      class ConversationState:
                          session_id: str
                          messages: list[dict] = field(default_factory=list)   # NOT messages: list = []
                          metadata: dict = field(default_factory=dict)
                          total_tokens: int = 0
                          is_complete: bool = False
                          last_model: Optional[str] = None  # Optional fields use Optional[T]
                      
                          def add_message(self, role: str, content: str) -> None:
                              self.messages.append({"role": role, "content": content})
                      
                          @property
                          def message_count(self) -> int:
                              return len(self.messages)
                      
                      
                      # Usage
                      state = ConversationState(session_id="sess_abc123")
                      state.add_message("user", "What is RAG?")
                      state.add_message("assistant", "RAG stands for Retrieval-Augmented Generation...")
                      state.total_tokens = 150
                      
                      print(f"Session {state.session_id}: {state.message_count} messages, {state.total_tokens} tokens")

Frozen Dataclasses: Immutable Configuration

@dataclass(frozen=True) makes instances immutable — equivalent to a C# record with init-only properties. This is ideal for configuration objects that should not change after startup:

from dataclasses import dataclass
                      
                      @dataclass(frozen=True)
                      class ModelConfig:
                          """Immutable AI model configuration."""
                          model_name: str
                          max_tokens: int
                          temperature: float
                          system_prompt: str
                      
                          def __post_init__(self) -> None:
                              """Validation runs after __init__ — use for invariant checking."""
                              if not 0.0 <= self.temperature <= 1.0:
                                  raise ValueError(f"temperature must be 0.0–1.0, got {self.temperature}")
                              if self.max_tokens < 1:
                                  raise ValueError(f"max_tokens must be positive, got {self.max_tokens}")
                      
                      
                      # Config is set once and never mutated
                      PRODUCTION_CONFIG = ModelConfig(
                          model_name="claude-3-5-sonnet-20241022",
                          max_tokens=1024,
                          temperature=0.3,
                          system_prompt="You are a helpful AI assistant.",
                      )
                      
                      # This raises FrozenInstanceError:
                      # PRODUCTION_CONFIG.temperature = 0.9

Dataclass vs Dict vs Class: When to Use Each

Situation	Best Choice	Reason
Typed, named data container	`@dataclass`	IDE support, equality, repr, validation
Immutable config or value object	`@dataclass(frozen=True)`	Hashable, safe to share across threads
Schema validation + serialization	Pydantic model	Validators, JSON export, API contracts
Simple local data grouping	`dict`	When typing overhead isn't worth it
Behavior-heavy domain object	Regular class	When you need inheritance, properties, methods

Practical AI Dataclasses

from dataclasses import dataclass, field
                      from typing import Optional
                      from datetime import datetime
                      import json
                      
                      @dataclass
                      class RetrievedChunk:
                          """A document chunk returned by a vector database query."""
                          chunk_id: str
                          document_id: str
                          text: str
                          score: float              # cosine similarity score
                          metadata: dict = field(default_factory=dict)
                      
                          @property
                          def is_relevant(self) -> bool:
                              return self.score >= 0.75
                      
                      
                      @dataclass
                      class RAGRequest:
                          """A request to the RAG pipeline."""
                          query: str
                          top_k: int = 5
                          min_score: float = 0.7
                          filters: dict = field(default_factory=dict)
                          session_id: Optional[str] = None
                          created_at: datetime = field(default_factory=datetime.now)
                      
                      
                      @dataclass
                      class RAGResponse:
                          """A complete response from the RAG pipeline."""
                          answer: str
                          source_chunks: list[RetrievedChunk] = field(default_factory=list)
                          input_tokens: int = 0
                          output_tokens: int = 0
                          latency_ms: float = 0.0
                      
                          def to_dict(self) -> dict:
                              """Serialize to dict for JSON storage or API response."""
                              return {
                                  "answer": self.answer,
                                  "sources": [
                                      {"id": c.chunk_id, "score": c.score, "text": c.text[:200]}
                                      for c in self.source_chunks
                                  ],
                                  "tokens": {"input": self.input_tokens, "output": self.output_tokens},
                                  "latency_ms": self.latency_ms,
                              }
                      
                      
                      # Usage in a pipeline
                      chunks = [
                          RetrievedChunk("c1", "doc_001", "RAG retrieves relevant context...", 0.92),
                          RetrievedChunk("c2", "doc_003", "Vector databases store embeddings...", 0.81),
                      ]
                      
                      response = RAGResponse(
                          answer="RAG combines retrieval with generation...",
                          source_chunks=[c for c in chunks if c.is_relevant],
                          input_tokens=230,
                          output_tokens=150,
                          latency_ms=1340.5,
                      )
                      
                      print(json.dumps(response.to_dict(), indent=2))

📚 dataclass vs Pydantic

Dataclasses are the right choice for internal data containers where you control both the producer and consumer. Once you cross an API boundary — receiving external JSON, validating user input, or defining an API contract — use Pydantic (covered in Lesson 1.2). Pydantic adds runtime validation, coercion, and JSON serialization that dataclasses lack.

Key Takeaways

@dataclass generates __init__, __repr__, and __eq__ from field annotations — Python's equivalent of C# record
Use field(default_factory=list) for mutable defaults — never use field: list = [] directly
__post_init__ runs after __init__ and is the right place for validation and derived field computation
@dataclass(frozen=True) creates immutable instances — ideal for configuration objects that should be shared safely
Prefer dataclasses over dicts for any structured data passed between functions — IDE support and early error detection are worth the added lines
When you need validation and JSON serialization, upgrade to Pydantic — dataclasses are for internal structures, Pydantic is for API contracts

Back to Complete Learning Series Python & AI Engineering for .NET Developers

Comments

Loading comments…