For schema evolution Mergeschema can be used in Spark for Parquet file formats, and I have below clarifications on this
Does this support only Parquet file format or any other file formats like csv,txt files.
If new additional columns are added in between I understand Mergeschema will move the columns to last.
And if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manually by selecting all the columns.
Update from Comment :
for example If I have a schema as below and create table as below – spark.sql("CREATE TABLE emp USING DELTA LOCATION '****'")
empid,empname,salary====> 001,ABC,10000
and next day if I get below format empid,empage,empdept,empname,salary====> 001,30,XYZ,ABC,10000
.
Whether new columns – empage, empdept
will be added after empid,empname,salary columns
?
Best Answer
Q : 1. Does this support only Parquet file format or any other file formats like csv,txt files. 2. if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manuallly by selecting all the columns
AFAIK Merge schema is supported only by parquet not by other format like csv , txt.
Mergeschema (
spark.sql.parquet.mergeSchema
) will align the columns in the correct order even they are distributed.Example from spark documentation on parquet schema-merging:
UPDATE : Real example given by you in the comment box...
Answer : Yes EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column.
see the full example.
Result :
Directory tree :