Nothing to declare - from NaN to None via null


Recently, I have encountered an interesting bug in one of the projects I am working on. It is a distributed algorithm execution system, where we execute the algorithms in separate processes, and the intermediate results are passed between and stored as JSON objects.

One algorithm calculates numerical thresholds for another to use when processing data. We represent those thresholds as floating-point numbers.

We do the data processing in Python and use Pydantic for data validation and serialization. We have defined all algorithms in a single codebase, enabling us to share data models across processes.

The simplified version of the code looks like this:

from pydantic import BaseModel


class Thresholds(BaseModel):
    high: float
    low: float

# Algorithm 1
def calculate_thresholds(data) -> Thresholds:
    pass  # create Thresholds object from data input


# Algorithm 2
def process_data(data, thresholds: Thresholds):
    pass  # process data using thresholds
"""Process 1"""
calculated_thresholds = calculate_thresholds(data)
# Send calculated_thresholds to another process as JSON string
results_as_json_string = calculated_thresholds.model_dump_json()
send_results(
    results=results_as_json_string,
    source="algorithm_1"
)
"""Process 2"""
received_thresholds_as_json_string = receive_results(
    source="algorithm_1"
)
received_thresholds = Thresholds.model_validate_json(
    received_thresholds_as_json_string
)
process_data(data2, received_thresholds)

For most of the inputs, everything works fine. Thresholds are calculated correctly and passed between processes without issues. However, there was a particular input data set for which one of the thresholds had a “not a number” (NaN) value. The serialization went fine, but when the second process tried to deserialize the JSON string back into The Thresholds object, it failed with the following error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Thresholds
high
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]

How come the input value is None, not NaN? Why Pydantic can’t process the serialized input value that was generated by itself?

>>> thr = Thresholds(high=float('nan'), low=0.0)
>>> json_str = thr.model_dump_json()
>>> print(json_str)
'{"high": null, "low": 0.0}'

>>> Thresholds.model_validate_json(json_str)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Thresholds
high
  Input should be a valid number [type=float_type, input_value=None, input_type=None]

Where does the null come from?

And most importantly, how to fix this issue?

We will answer these questions in this post. But first, let’s take a closer look at the involved concepts.

The IEEE 754 standard defines floating-point numbers. There are many great resources explaining the details of this standard. I highly recommend floats.exposed and How Integers and Floats Work

Briefly, a floating-point number is represented by three components: sign, exponent, and mantissa. This representation allows for the expression of a wide range of values, including special values such as positive and negative infinity, and “not a number” (NaN). NaN is used to represent undefined or unrepresentable values, such as the product of zero and infinity.

>>> import math
>>> math.inf * 0
nan

Internally, NaN is represented by a specific bit pattern in the floating-point representation, which has all exponent bits set to 1 and a non-zero mantissa.

An interesting property of NaN is that it is not equal to any value, including itself. Also, any comparison with NaN returns False. In the context of the Threshold model, NaN is a valid float value that we use to indicate that a threshold is undefined or not applicable, since any comparison with NaN will return False.

>>> thr = Threshold(high=float('nan'), low=0.0)
>>> thr.high > 10.0
False
>>> thr.high == 10.0
False
>>> thr.high < 10.0
False

In Python we create NaN using float('nan') or math.nan.

JSON (JavaScript Object Notation) is a lightweight data interchange format. It used widely for data exchange between systems, especially in web applications.

JSON supports data types as values: string, number, object, and array. Additionally, three literal values: true, false, and null are supported.

RFC8259 defines them as follows:

3.  Values
   A JSON value MUST be an object, array, number, or string, or one of
   the following three literal names:
      false
      null
      true
   The literal names MUST be lowercase.  No other literal names are
   allowed.
      value = false / null / true / object / array / number / string
      false = %x66.61.6c.73.65   ; false
      null  = %x6e.75.6c.6c      ; null
      true  = %x74.72.75.65      ; true

From the above section, we can draw two important conclusions:

  • In JSON, there is no explicit representation for a floating-point number, but only a generic “number” type (the same as in JavaScript).
  • There is no representation for the NaN value in the JSON standard.

There is no way to represent a floating-point NaN value in JSON, so the serializers must choose an alternative representation. Most JSON libraries, including Python’s built-in json module, serialize NaN as null.

In Python, None is a special object that represents the absence of a value or a null value. It’s used in various places in Python, such as the default return value of functions that don’t explicitly return anything, or as a placeholder for optional function arguments. You can find section for more details about None in Python in the Real Python article.

Most importantly in our context, None is used when deserializing JSON null values into Python objects.

>>> import json
>>> json.loads('{"value": null}')
{'value': None}

Now that we have a basic understanding of the concepts involved, let’s analyze the issue we encountered.

When Pydantic serializes the Thresholds object with a NaN value to a JSON string, it must choose a representation for NaN, since the JSON standard does not support it.

Pydantic v2 uses its own serialization engine called pydantic-core. It’s written in Rust and provides high-performance serialization and deserialization. You can find the link to the relevant source code here

By default, pydantic-core serializes NaN float values as JSON null, which follows the practice used also by the built-in json module.

When deserializing the JSON string back into the Thresholds object, Pydantic encounters the null value for the high field. According to JSON deserialization rules, null is mapped to Python None. Yet, since we have defined the high field as a float and None is not a valid float value, the validation fails with the error message: "Input should be a valid number."

Knowing what we already know about NaN, null, and None, the error message makes sense now.

In my opinion, if we view it from a formal perspective, the serialization and deserialization functions should be inverses of each other.

One can assume that if f(x) = y then f_inverse(y) = x

But, in our case this does not hold. The following code throws a validation error instead of returning the original Thresholds object.

>>> Thresholds.model_validate_json(
    Thresholds(high=float("nan"), low=0.0).model_dump_json()
)

Luckily, there are several ways to solve this issue.

None of them is perfect, so we have to choose the one that best fits our use case.

RFC8259 explicitly states that the JSON standard only supports true, false, and null as literal values. In practice, many JSON libraries and parsers support additional literal values such as NaN, Infinity, and -Infinity.

In fact, it is the case with Python’s built-in json module, which can serialize and deserialize NaN values by using the allow_nan parameter (which is set to True by default).

>>> import json
>>> json_str = json.dumps(
    {'value': float('nan')},
    allow_nan=True
)
>>> print(json_str)
'{"value": NaN}'

Pydantic-core also supports this behavior, but it is not the default. We can enable it in the Pydantic model configuration, by setting the ser_json_inf_nan parameter to 'constants'

class ThresholdsConst(Thresholds):
    model_config = ConfigDict(ser_json_inf_nan='constants')

thr = ThresholdsConst(high=float('nan'), low=float('inf'))
json_str = thr.model_dump_json()
print(json_str)  # '{"high": NaN, "low": Infinity}'

When deserializing, both Pydantic and built-in json module will correctly interpret NaN and Infinity values.

>>> json_str = '{"high": NaN, "low": 0.1}'
>>> thr = Thresholds.model_validate_json(json_str)
>>> print(thr)
high=nan low=inf

>>> json.loads(json_str)
{'high': nan, 'low': inf}

Seems like a perfect solution. And it might be, as long as we stay within the Python ecosystem.

Although, if we exchange the JSON data with other systems or languages, like JavaScript for the visualization of the results, or some document DB for the data storage, we might run into issues. Many JSON parsers in other languages do not support NaN and Infinity values, and will fail to parse such JSON strings.

>>> JSON.parse('{"value": NaN}') 
Uncaught SyntaxError: JSON.parse: unexpected character at line 1 column 11 of the JSON data

Second approach is to define the model fields as optional, allowing them to be None.


class ThresholdsOpt(Thresholds):
    high: float | None
    low: float | None

thr = ThresholdsOpt(high=float('nan'), low=0.0)
json_str = thr.model_dump_json()
print(json_str)  # '{"high": null, "low": 0.0}

thr2 = ThresholdsOpt.model_validate_json(json_str)
print(thr2)  # high=None low=0.0

By following this solution, the serialization and deserialization flow will also work correctly. Additionally, we will be fully compliant with the JSON standard, since we are only using null to represent missing values.

However, this solution also has some drawbacks.

The model fields are no longer strictly floats, so we lose some type safety.

For example, we cannot guarantee that the comparison operations will behave as expected, since we can’t compare None with float values.

>>> thr = ThresholdsOpt(high=None, low=0.0)
>>> thr.high > 10.0
TypeError: '>' not supported between instances of 'NoneType' and 'int'

Even more, the inversion property of serialization and deserialization functions is still not satisfied, since NaN is serialized as null, and deserialized back to None.

To preserve the NaN semantics, we would have to handle the conversion between None and NaN manually in our code. This makes our life more difficult, and the code more error-prone.

The third approach is to define a custom field type that can handle NaN values explicitly.

Pydantic provides a way to define custom types using the Annotated type hinting.

from typing import Annotated
from pydantic import BeforeValidator, PlainSerializer, BaseModel
import math

FloatNaN = Annotated[
    float,
    PlainSerializer(lambda v: None if math.isnan(v) else v, when_used="json"),
    BeforeValidator(lambda v: float("nan") if v is None else v),
]

class ThresholdsCustom(BaseModel):
    high: FloatNaN
    low: FloatNaN

thr = ThresholdsCustom(high=float('nan'), low=0.0)
json_str = thr.model_dump_json()
print(json_str)  # '{"high": null, "low": 0.0}'
thr2 = ThresholdsCustom.model_validate_json(json_str)
print(thr2)  # high=nan low=0.0

This way, we have complete control over the serialization and deserialization process. We can explicitly define how we handle the NaN values and ensure that we satisfy the inversion property.

In the example above, we define a custom type, FloatNaN, that uses a PlainSerializer to convert NaN to None, which it later serializes as null in JSON. During deserialization, we use a BeforeValidator to convert None back to NaN.

We maintain the type safety of the float fields, remain compliant with the JSON standard, and preserve the semantics of floating-point NaN.

What is more, we can easily reuse the FloatNaN type in other models or as a part of more complex data structures.

It even allows other-than-Python systems to parse the JSON strings correctly, but they will have to handle the conversion between null and NaN themselves.

Finally, sometimes if we want to support the optional FloatNaN fields, we can achieve it by defining different serialization and deserialization logic in the custom type.

For example, we can use special string values to represent NaN values in JSON.

FloatNaNStr = Annotated[
    float,
    PlainSerializer(lambda v: "NaN" if math.isnan(v) else v, when_used="json"),
    BeforeValidator(lambda v: float("nan") if v == "NaN" else v),
]

By the way, Pydantic also supports this approach. ser_json_inf_nan parameter can be set to 'strings' to serialize NaN and Infinity as strings.

We have explored serializing and deserializing NaN float values in Python using Pydantic and JSON. We can identify the root cause of the issue as the lack of NaN representation in the JSON standard, which explains why NaN is serialized as null but deserialized back to None.

We have also discussed three possible solutions to this issue:

  1. Extending the JSON standard to support NaN and Infinity literal values.
  2. Defining the model fields as optional, allowing them to be None.
  3. Creating a custom field type that handles NaN values explicitly.

Would I always choose the custom field approach? Not necessarily. Each solution has its own pros and cons, and as The Zen of Python says: “Simple is better than complex”.

When I am sure that I will only use the JSON data within the Python ecosystem, and the intermediate brokers or storage systems only treat the data as opaque strings, I would go with the first solution, as it is the simplest one.

Otherwise, I would choose the custom field approach, as it provides the most flexibility and control over the serialization and deserialization process.