Bash – Merge CSV files using join/awk/sed

awkbashcsvjoin;sed

Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + … + fileX.csv" into "ouput.csv".

For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:

template.csv:

header
1
2
3
4
5
6
7
8
9

file1.csv:

header,value1
2,value12
3,value13
7,value17
8,value18
9,value19

file2.csv:

header,value2
1,value21
2,value22
3,value23
4,value24

file3.csv:

header,value3
2,value32
4,value34
6,value36
7,value37
8,value38

output.csv:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc…) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, … but I don't see how …

IMPORTANT UPDATE

The first column of my real files are containing a datetime and not a simple number … so the script must take into account the space between the date and the time … sorry for the update !

Script should be now designed with the below csv files as example:

template.csv:

header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00

file1.csv:

header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19

file2.csv:

header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24

file3.csv:

header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38

output.csv:

header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Best Answer

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR { key[++numRows] = $1 }
{ fld[$1,ARGIND] = $NF }
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", fld[key[rowNr],colNr], (colNr<ARGIND ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk template.csv file1.csv file2.csv file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

The above uses GNU awk for ARGIND, with other awks just add a line that says FNR==1 { ++ARGIND }.

Related Solutions

How to replace a newline (\n) using sed

sed is intended to be used on line-based input. Although it can do what you need.

A better option here is to use the tr command as follows:

tr '\n' ' ' < input_filename

or remove the newline characters entirely:

tr -d '\n' < input.txt > output.txt

or if you have the GNU version (with its long options)

tr --delete '\n' < input.txt > output.txt

R – How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

~~Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.~~ I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

Best Answer

Related Solutions

How to replace a newline (\n) using sed

R – How to join (merge) data frames (inner, outer, left, right)

Related Topic