AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

amazon-athenahiveparquetpyarrow

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.

I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.

With those columns, I created Schema like:

create external table (
    key int,
    result double,
    location string,
    vehicle_name string.
    filename string
)

When I queried the table, I would get

HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema

So, I modified a schema with result datatype as INT.

Then I queried the table and got,

HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema

I've looked around to try to understand why this might happen but found no solution.

Any suggestion is much appreciated.

Best Answer

It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.

Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).

The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).

Related Topic