Learn to Use @dataclass in Python
Intro
I like Python’s tuple, which allows us to quickly bundle together values of different types as a single entity and manage them in an intuitive and easy-to-use way. However, I find that once the tuple has many fields, I’m forced to add a comment to indicate the meaning of each field. For example,
t = (3, 4, 3.5) # (x, y, value)
Then, I can jump to this definition and check the comment to see each field’s meaning. But indeed, this brings a lot of inconvenience, once the code becomes longer it’s not easy to locate that specific line of code.
So I would write a class, which allows me to reference a field by name instead of by position. The naive way contains boilerplate template, including __init__, __repr__
, etc. The topic today is dataclass
, which saves us from writing such a boilerplate template.
To learn a programming language feature, you just need to learn three important questions:
- What’s the syntax?
- What’s the semantic?
- What’s the usage?
Syntax
from dataclasses import dataclass
@dataclass
class Point:
x: int
y: int
value: float = 0.0
The example above defines a Point
class, which contains 3 fields: coordinates (x, y)
and the corresponding value
(the default value is 0). We can draw some conclusions here:
- The
dataclass
is a python decorator - For instance variables of class, we must add the type hints. The default value is optional.
Semantic
So How could this be helpful?
... # omitted
foo = Point(3, 4, 3.5) # __init__
bar = Point(3, 4, 3.5) # __init__
print(foo) # __repr__
print(foo.x) # named reference
print(foo == bar) # __eq__
So, we can clearly see the @dataclass
decorator helps us implementing __init__, __repr__, __eq__
, and we can reference a specific field by name
The @dataclass
decorator will add fields with type hints to Class’s __annotations__
(in the declaration order)
print(Point.__annotations__)
# {'x': <class 'int'>, 'y': <class 'int'>, 'value': <class 'float'>}
To summarize, the semantics are:
- The
@dataclass
decorator will generate some dunder method for us, including__init__, __repr__, __eq__
. If you implement these methods by yourself, then your implementation will take precedence. - The fields with type hints will become the arguments of generated functions, in the declaration order.
Take __init__
as an example, the generated method is as follows:
class Point:
...
def __init__(self, x: int, y: int, value: float = 0.0):
self.x = x
self.y = y
self.value = value
Advanced Usage
The aforementioned syntax and semantics are enough for basic usage. However, the dataclass
is much more powerful than you might think. It enables us to control the generation behaviors in general, as well as the behavior of each field
The @dataclass
is a decorator, that is, a kind of special function. We can control the generation behaviors by modifying the parameters. The most important arguments in my opinion are as follows (I also show the default settings)
@dataclass(
init=true, # generate __init__ method
repr=true, # generate __repr__ method
# default format: <classname>(field1=..., field2=..., ...)
eq=true, # compare dataclasses like tuples
order=false, # generate __lt/lt/gt/ge__ methods
frozen=false, # if true, assigning to fields will generate an exception
)
Take order
as an example, we wish the Point
is comparable: first by comparing the coordinates (x, y)
and then by comparing the value
from dataclasses import dataclass
@dataclass(order=True)
class Point:
x: int
y: int
value: float
foo = Point(3, 4, 3.5) # __init__
bar = Point(3, 4, 4.5) # __init__
print(foo < bar) # __eq__
To control each field’s behavior, we need to use the field
in the dataclass
library. It also has many arguments, you may refer to the official documentation, I will only talk about some important arguments in my opinion
default, default_factory
, we use one of the arguments to set the default values. The former directly sets a default value, while the latter specifies a constructor without arguments (For example,list, set
, etc.)repr
, should we generate the string representation for this field?
Let’s say now we change the value
field to values
, that is, we want each coordinate to hold a list of values
from dataclasses import dataclass, field
@dataclass
class Point:
x: int
y: int
value: list[float] = field(default_factory=list)
foo = Point(3, 4, [3.5, 4.5, 5.5]) # __init__
print(foo.__annotations__)
Finally, let’s talk about the inheritance scenario. The data class decorated by @dataclass
is also a data class, we can inherit another data class as we wish. What if both of them contain fields with the same name? For example
from dataclasses import dataclass
@dataclass
class A:
x: int = 1
y: int = 2
z: int = 5
@dataclass
class B(A):
x: int = 3
y: int = 4
foo = B()
print(foo)
# B(x=3, y=4, z=5)
It works like this1:
- Using MRO to decide the visit order, that is, starting from
Object
class and collecting fields - Finally, add the data class’s fields and merge the result. If multiple fields have the same name, the latter will override the earlier ones.
FAQ
How to differentiate instance variables and class variables in the data class
Use type hints to distinguish between instance variables and class variables, where the type of a class variable is typing.ClassVar
from typing import ClassVar
from dataclasses import dataclass, field
@dataclass
class Point:
x: int
y: int
value: list[float] = field(default_factory=list)
a_class_variable: ClassVar[int] = 3
a_point = Point(3, 4)
print(Point.a_class_variable)
vs namedtuple
@dataclass |
collections.namedtuple |
|
---|---|---|
Different types but have same fields’ value(e.g. Point3D(2017, 6, 2) == Date(2017, 6, 2) ) |
❌ | ✅ |
Set default value for a field | ✅ | ❌ |
Control each field’s behavior (__init__, __repr__, etc ) |
✅ | ❌ |
Merge fields by inheritance | ✅ | ❌ |
Wrap-up
Python’s @dataclass
can let’s use describe a data class in declarative way. We only need to describe each field’s type, default value, etc. The @dataclass
can generate some useful methods automatically for us. The mental model might be: data class = mutable namedtuple with default value 👍