Learn to Use @dataclass in Python

MartinLwx included in category Python

2024-08-17 2024-08-17 994 words 5 minutes

Contents

Intro

I like Python’s tuple, which allows us to quickly bundle together values of different types as a single entity and manage them in an intuitive and easy-to-use way. However, I find that once the tuple has many fields, I’m forced to add a comment to indicate the meaning of each field. For example,

python

t = (3, 4, 3.5)  # (x, y, value)

Then, I can jump to this definition and check the comment to see each field’s meaning. But indeed, this brings a lot of inconvenience, once the code becomes longer it’s not easy to locate that specific line of code.

So I would write a class, which allows me to reference a field by name instead of by position. The naive way contains boilerplate template, including __init__, __repr__ , etc. The topic today is dataclass, which saves us from writing such a boilerplate template.

Tip

To learn a programming language feature, you just need to learn three important questions:

What’s the syntax?
What’s the semantic?
What’s the usage?

Syntax

python

from dataclasses import dataclass


@dataclass
class Point:
    x: int
    y: int
    value: float = 0.0

The example above defines a Point class, which contains 3 fields: coordinates (x, y) and the corresponding value (the default value is 0). We can draw some conclusions here:

The dataclass is a python decorator
For instance variables of class, we must add the type hints. The default value is optional.

Semantic

So How could this be helpful?

python

... # omitted
foo = Point(3, 4, 3.5)  # __init__
bar = Point(3, 4, 3.5)  # __init__
print(foo)  # __repr__
print(foo.x)  # named reference
print(foo == bar)  # __eq__

So, we can clearly see the @dataclass decorator helps us implementing __init__, __repr__, __eq__, and we can reference a specific field by name

The @dataclass decorator will add fields with type hints to Class’s __annotations__ (in the declaration order)

python

print(Point.__annotations__)
# {'x': <class 'int'>, 'y': <class 'int'>, 'value': <class 'float'>}

To summarize, the semantics are:

The @dataclass decorator will generate some dunder method for us, including __init__, __repr__, __eq__. If you implement these methods by yourself, then your implementation will take precedence.
The fields with type hints will become the arguments of generated functions, in the declaration order.

Take __init__ as an example, the generated method is as follows:

python

class Point:
    ...
    def __init__(self, x: int, y: int, value: float = 0.0):
        self.x = x
        self.y = y
        self.value = value

Advanced Usage

The aforementioned syntax and semantics are enough for basic usage. However, the dataclass is much more powerful than you might think. It enables us to control the generation behaviors in general, as well as the behavior of each field

The @dataclass is a decorator, that is, a kind of special function. We can control the generation behaviors by modifying the parameters. The most important arguments in my opinion are as follows (I also show the default settings)

python

@dataclass(
    init=true,      # generate __init__ method
    repr=true,      # generate __repr__ method
                    # default format: <classname>(field1=..., field2=..., ...)
    eq=true,        # compare dataclasses like tuples
    order=false,    # generate __lt/lt/gt/ge__ methods
    frozen=false,   # if true, assigning to fields will generate an exception
)

Take order as an example, we wish the Point is comparable: first by comparing the coordinates (x, y) and then by comparing the value

python

from dataclasses import dataclass


@dataclass(order=True)
class Point:
    x: int
    y: int
    value: float


foo = Point(3, 4, 3.5)  # __init__
bar = Point(3, 4, 4.5)  # __init__
print(foo < bar)  # __eq__

To control each field’s behavior, we need to use the field in the dataclass library. It also has many arguments, you may refer to the official documentation, I will only talk about some important arguments in my opinion

default, default_factory, we use one of the arguments to set the default values. The former directly sets a default value, while the latter specifies a constructor without arguments (For example, list, set, etc.)
repr, should we generate the string representation for this field?

Let’s say now we change the value field to values, that is, we want each coordinate to hold a list of values

python

from dataclasses import dataclass, field


@dataclass
class Point:
    x: int
    y: int
    value: list[float] = field(default_factory=list)


foo = Point(3, 4, [3.5, 4.5, 5.5])  # __init__
print(foo.__annotations__)

Finally, let’s talk about the inheritance scenario. The data class decorated by @dataclass is also a data class, we can inherit another data class as we wish. What if both of them contain fields with the same name? For example

python

from dataclasses import dataclass


@dataclass
class A:
    x: int = 1
    y: int = 2
    z: int = 5


@dataclass
class B(A):
    x: int = 3
    y: int = 4


foo = B()
print(foo)
# B(x=3, y=4, z=5)

It works like this¹:

Using MRO to decide the visit order, that is, starting from Object class and collecting fields
Finally, add the data class’s fields and merge the result. If multiple fields have the same name, the latter will override the earlier ones.

FAQ

How to differentiate instance variables and class variables in the data class

Use type hints to distinguish between instance variables and class variables, where the type of a class variable is typing.ClassVar

python

from typing import ClassVar
from dataclasses import dataclass, field


@dataclass
class Point:
    x: int
    y: int
    value: list[float] = field(default_factory=list)
    a_class_variable: ClassVar[int] = 3


a_point = Point(3, 4)
print(Point.a_class_variable)

vs namedtuple

	`@dataclass`	`collections.namedtuple`
Different types but have same fields’ value(e.g. `Point3D(2017, 6, 2) == Date(2017, 6, 2)`)	❌	✅
Set default value for a field	✅	❌
Control each field’s behavior (`__init__, __repr__, etc`)	✅	❌
Merge fields by inheritance	✅	❌

Wrap-up

Python’s @dataclass can let’s use describe a data class in declarative way. We only need to describe each field’s type, default value, etc. The @dataclass can generate some useful methods automatically for us. The mental model might be: data class = mutable namedtuple with default value 👍

Refs

PEP 557 – Data Classes ↩︎ ↩︎