Intro and Takeaways

I recently started investigating performance differences between the different data class libraries in Python: dataclass, attrs, and pydantic.This simple investigation quickly spiralled into many different threads. I wrote this post partly to rein in the chaos, and partly to better understand the data class landscape. This post is not new ground, and is heavily influenced by two great posts investigating a similar question:

My big takeaway is performance differences are very small, i.e. not large enough that one should choose one package over another based on performance. If you really want the fastest performance, use either attrs or dataclass without validation using only positional argument calls. The only significant slow down is using datetime.datetime or datetime.date objects in pydantic. I don’t want to wholly recommend against it, but if performance is an utmost concern then one should investigate further when using datetime objects.

Experimental Setup and Measurement

To measure performance, I used the timeit module. As per timeit’s documents, I measured the performance using timeit.repeat(number=1_000_000, repeat=5), taking the minimum value of the 5 runs.

I started by defining a basic class Inventory which has three fields: item, price, quantity. I keep the type annotation blank, as defining more generic types (think Any) are a potential performance improvement. Note, this performance investigation is focused on this simple non-nested class. There is a comment in the pydantic documents that init is faster than model_construct for simpler versus more complex classes. Thus, I don’t suggest these results are generically true.

class InventoryAttrs:
    item
    price 
    quantity

We’ll walk through a series of experiments, discoveries I made along the way. I use partial to define a simple time_experiment function I use throughout:

time_experiment = partial(
    timeit.repeat,
    setup='pass',
    number=LOOPS,
    globals=globals()
)

Comparing “Basic” Classes

First, we will compare the following:

from pydantic import BaseModel 
from attrs import define 

# Define classes for pydantic, dataclass, and attrs
class InventoryPydantic(BaseModel):
    item: str 
    price: float
    quantity: int = 0 

@dataclass
class InventoryDataclass:
    item: str 
    price: float 
    quantity: int = 0

@define 
class InventoryAttrs:
    item: str 
    price: float 
    quantity: int = 0

Our experiment instantiates each of the classes using Keyword arguments, i.e:

pydantic_base = time_experiment(stmt="InventoryPydantic(item='banana', price=1.99, quantity=10)")
dataclass_base = time_experiment(stmt="InventoryDataclass(item='banana', price=1.99, quantity=10)")
attrs_base = time_experiment(stmt="InventoryAttrs(item='banana', price=1.99, quantity=10)")

We can easily see that dataclass and attrs is much much faster. This is a well known outcome. The reason is that pydantic not only initializes an object, it also runs validation on the attributes of the object, i.e. checking the item is actually a str etc.

Let’s try to do an apples to apples comparison by adding validation requirements to attrs. Note, there are no methods to validate dataclass within the standard library, so we ignore it.

Comparing Classes using validations

We will keep our InventoryPydantic, but define two new classes with validations:

from attrs import validators, field, define 

@define 
class InventoryAttrsValidated:
    item: str = field(validator=validators.instance_of(str))
    price: float = field(validator=validators.instance_of(float))
    quantity: int = field(default=0, validator=validators.instance_of(int))

I ran the following experiment:

pydantic_base = time_experiment(stmt="InventoryPydantic(item='banana', price=1.99, quantity=10)")
attrs_validated = time_experiment(stmt="InventoryAttrsValidated(item='banana', price=1.99, quantity=10)")

We can see that even with validation, attrs performs better than pydantic:

Trying to speed up Pydantic…unsuccessfully

There were a few methods I attempted to speed up Pydantic. Note, I’m on pydantic==2.5.2, so some of the tips and tricks may not work on this. Here’s the few I tried:

  • Use SkipValidation type, referenced in the Validator section of the pydantic docs
  • Use Any type
  • Instantiate class with model_constructs. Note, the docs explicitly call out of that this may be slower for simpler classes.
# Import additional modules
from pydantic import SkipValidation, BaseModel 
from typing import Any 

# pydantic with any types
class InventoryPydanticAny(BaseModel):
    # https://docs.pydantic.dev/latest/concepts/performance/
    item: Any 
    price: Any
    quantity: Any = 0 

# pydantic with skip validation types
class InventoryPydanticSkipValidation(BaseModel):
    item: SkipValidation[str]
    price: SkipValidation[float]
    quantity: SkipValidation[int] = 0 

I ran the following experiment:

pydantic_base = time_experiment(stmt="InventoryPydantic(item='banana', price=1.99, quantity=10)")
pydantic_any = time_experiment(stmt="InventoryPydanticAny(item='banana', price=1.99, quantity=10)")
pydantic_skip_validation = time_experiment(stmt="InventoryPydanticSkipValidation(item='banana', price=1.99, quantity=10)")
pydantic_model_constructs = time_experiment(stmt="InventoryPydantic.model_constructs(item='banana', price=1.99, quantity=10)")

Surprisingly, none of these methods sped up Pydantic. Note, this may simply be due to the simplicity of the class.

Trying to speed up Dataclass and Attrs…somewhat successfully

As I was playing around with some different permutations, I stumbled on one sure fire way to improve performance, pass in only positional arguments when not running validation.

pydantic_base = time_experiment(stmt="InventoryPydantic(item='banana', price=1.99, quantity=10)")
attrs_base = time_experiment(stmt="InventoryAttrsValidated(item='banana', price=1.99, quantity=10)")
attrs_positional_only = time_experiment(stmt="InventoryAttrsValidated('banana', 1.99, 10)")
attrs_positional_only_with_validation = time_experiment(stmt="InventoryAttrsValidated('banana', 1.99, 10)")
attrs_positional_and_keyword_with_validation = time_experiment(stmt="InventoryAttrsValidated('banana', price=1.99, quantity=10)")

pydantic requires keyword arguments, so we can only test this for attrs. What we see is quite a performance improvement. Interestingly, when you run validation for attrs this goes away.

The impact of positional versus keyword arguments was fairly surprising to me, though there’s some talk of this when I do some searches. I had assumed that when you pass in an argument, it gets bound to an attribute in the same manner, regardless of if you passed it in as a positional or a keyword argument. Turns out this is wrong, and there’s a bit of a rabbit hole you can go down, see this (potentially outdated) blog post. TLDR; there are optimizations within CPython for when a function is called, if the function only provides positional arguments there’s a faster path that’s taken.

datetime worsens Pydantic performance

Another issue I stumbled upon was how datetime was slow to validate for pydantic. I was able to validate this, trying both datetime.datetime and datetime.date. I had hypothesized since date doesn’t have time attributes, maybe it would be significantly faster.

pydantic_base = time_experiment(stmt="InventoryPydantic(item='banana', price=1.99, quantity=10)")
attrs_base = time_experiment(stmt="InventoryAttrsValidated(item='banana', price=1.99, quantity=10)")
attrs_datetime = time_experiment(stmt="InventoryAttrsValidatedDatetime(item='banana', price=1.99, quantity=10, datetime=datetime.datetime(year=2023, month=10, day=1, hour=6, minute=30, second=10))")
attrs_date = time_experiment(stmt="InventoryAttrsValidatedDatetime(item='banana', price=1.99, quantity=10, date=datetime.date(year=2023, month=10, day=1))")
pydantic_datetime = time_experiment(stmt="InventoryPydanticDatetime(item='banana', price=1.99, quantity=10, datetime=datetime.datetime(year=2023, month=10, day=1, hour=6, minute=30, second=10))")
pydantic_date = time_experiment(stmt="InventoryPydanticDate(item='banana', price=1.99, quantity=10, date=datetime.date(year=2023, month=10, day=1))")

We see that pydantic is significantly slower with datetime. This was actually discussed in one of the blog posts mentioned up top. pydantic uses it’s own home rolled date time validation.

Conclusion

Overall, we see some interesting overall performance differences.

The most surprising was definitely around positional and keyword arguments, as well as, how datetime can slow down pydantic performance. Other areas of investigation here are, measuring performance for more complex classes (both nested and enum), measuring memory usage differences, as well as, further investigating how to speed up pydantic performance.