Common Issues

pandas type system

When using pandas as the DataFrame library, the history of the type system in that library can be a common source of confusion. Since pantab 4.0, rather than directly use the pandas type system pantab has used the Arrow C stream interface.

Technically, the way in which pandas types are mapped to Arrow types to fit the Arrow C Stream interface is an implementation detail of pandas that cannot be documented here. However, general guidance for different data types is provided in each section below. When using pandas I/O methods, you may want to use dtype_backend="pyarrow" to get the best types by default (you can see that keyword argument documented in the read_csv documentation.

Object Types

Generally you should avoid having columns that have dtype=object in your pandas DataFrame. For more information see Strings

Integral Types

Signed integer types map relatively well to Hyper types regardless of which pandas “backend” you use. Note that the int8 -> SMALLINT case is not lossless, i.e. when you try to read a SMALLINT back from Hyper you will always get a 16 bit integer type back.

numpy

pandas

pyarrow

Hyper

int8

Int8

int8[pyarrow]

SMALLINT

int16

Int16

int16[pyarrow]

SMALLINT

int32

Int32

int32[pyarrow]

INTEGER

int64

Int64

int64[pyarrow]

BIGINT

Generally unsigned types are not supported - users are expected to bounds check and size to a signed type appropriately. The only exception to this rule is a 32 bit unsigned integer, which is written as an OID type to Hyper.

Floating-point Types

Hyper only supports double precision floating point types. float types (often called float32 in the NumPy / pandas world) are upcast to double.

numpy

pandas

pyarrow

Hyper

float32

Float32

float[pyarrow]

DOUBLE PRECISION

float64

Float64

double[pyarrow]

DOUBLE PRECISION

Strings

Much can be written about strings and the history of them in pandas. Generally users use “strings” that are actually any of the following dtypes:

  • object

  • “string” (starting in 1.0)

  • “string[pyarrow]” (starting in 1.3)

  • “string[pyarrow_numpy]” (starting in 3.0)

object is a historic relic and should be avoided where possible. As far as pantab is concerned, using string[pyarrow] is the best string dtype as that will always map seamlessly to the Arrow C Stream interface. How the other types manage this is an implementation detail of pandas.

Temporal Types

pandas historically has only ever had a Timestamp data type. This has been problematic when writing to databases where DATE and TIMESTAMP are commonly different types.

If you would like to write DATE types to hyper, your best bet is to .astype("date32[pyarrow]") those columns to convert them into true date types. Do not use the pandas ``.dt.date`` accessor as this returns an ``object`` dtype.

Please also note that the default unit of precision for the pd.Timestamp type is nanoseconds since the Unix epoch. Since pandas 2.0 the as_unit method can be used to convert to a different unit. pyarrow supports different units as well, i.e. you can do .astype("timestamp[us][pyarrow]") to convert to a microsecond-precision timestamp based on the Unix epoch. Hyper stores timestamps using microsecond precision going back to the first midnight after the Julian epoch.

Binary

NumPy / pandas do not coordinate to support a binary array type. As such, if you are trying to write binary data, your only option is to use the binary[pyarrow] data type.