Common Issues

pandas type system

When using pandas as the DataFrame library, the history of the type system in that library can be a common source of confusion. Since pantab 4.0, rather than directly use the pandas type system pantab has used the Arrow C stream interface.

Technically, the way in which pandas types are mapped to Arrow types to fit the Arrow C Stream interface is an implementation detail of pandas that cannot be documented here. However, general guidance for different data types is provided in each section below. When using pandas I/O methods, you may want to use dtype_backend="pyarrow" to get the best types by default (you can see that keyword argument documented in the read_csv documentation.

Object Types

Generally you should avoid having columns that have dtype=object in your pandas DataFrame. For more information see Strings

Integral Types

Signed integer types map relatively well to Hyper types regardless of which pandas “backend” you use. Note that the int8 -> SMALLINT case is not lossless, i.e. when you try to read a SMALLINT back from Hyper you will always get a 16 bit integer type back.

numpy	pandas	pyarrow	Hyper
int8	Int8	int8[pyarrow]	SMALLINT
int16	Int16	int16[pyarrow]	SMALLINT
int32	Int32	int32[pyarrow]	INTEGER
int64	Int64	int64[pyarrow]	BIGINT

Generally unsigned types are not supported - users are expected to bounds check and size to a signed type appropriately. The only exception to this rule is a 32 bit unsigned integer, which is written as an OID type to Hyper.

Floating-point Types

Hyper only supports double precision floating point types. float types (often called float32 in the NumPy / pandas world) are upcast to double.

numpy	pandas	pyarrow	Hyper
float32	Float32	float[pyarrow]	DOUBLE PRECISION
float64	Float64	double[pyarrow]	DOUBLE PRECISION

Strings

Much can be written about strings and the history of them in pandas. Generally users use “strings” that are actually any of the following dtypes:

object
“string” (starting in 1.0)
“string[pyarrow]” (starting in 1.3)
“string[pyarrow_numpy]” (starting in 3.0)

object is a historic relic and should be avoided where possible. As far as pantab is concerned, using string[pyarrow] is the best string dtype as that will always map seamlessly to the Arrow C Stream interface. How the other types manage this is an implementation detail of pandas.

Temporal Types

pandas historically has only ever had a Timestamp data type. This has been problematic when writing to databases where DATE and TIMESTAMP are commonly different types.

If you would like to write DATE types to hyper, your best bet is to .astype("date32[pyarrow]") those columns to convert them into true date types. Do not use the pandas ``.dt.date`` accessor as this returns an ``object`` dtype.

Please also note that the default unit of precision for the pd.Timestamp type is nanoseconds since the Unix epoch. Since pandas 2.0 the as_unit method can be used to convert to a different unit. pyarrow supports different units as well, i.e. you can do .astype("timestamp[us][pyarrow]") to convert to a microsecond-precision timestamp based on the Unix epoch. Hyper stores timestamps using microsecond precision going back to the first midnight after the Julian epoch.

Binary

NumPy / pandas do not coordinate to support a binary array type. As such, if you are trying to write binary data, your only option is to use the binary[pyarrow] data type.