I was running an AWS Glue job where I was reading Parquet files from an S3 bucket. When I loaded the files individually there were no problems, but when I loaded the entire directory and tried to do any sort of transformation on the data I got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 23, ip-172-31-4-9.eu-west-1.compute.internal, executor 1): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://...
I couldn't find any useful information about this online, so thought I'd post the solution here in case anyone else has the same problem.
The problem was that some of the files had columns which were entirely Null and apparently Spark doesn't like that. Reading the files individually I guess it probably read the schema from the file, but reading them as a whole apparently caused errors. I solved the problem by dropping any Null columns before writing the Parquet files.