Bash: loop over 20000 files slow – why

bashext3reiserfs

A simple loop over a lot of files is half as fast on one system vs. the other.

using bash, I did something like

for * in ./
do
   something here
done

Using "time" I was able to confirm, that on system2 the "something here"-part runs faster than on system1. Nevertheless, the whole loop on system 2 takes double as long as on system1. Why? …and how can I speed this up?

There are about 20000 (text-)files in the directory. Reducing the number of files to about 6000 significantly speeds things up. These findings stay the same regardless of the looping-method (replacing "for * in" with a find command or even putting filenames in an array first).

System1: Debian (in an openvz-vm, using reiserfs)
System2: Ubuntu (native, faster Processor than System1, faster Raid5 too, using ext3 and ext4 – results stay the same)

So far I should have ruled out: hardware (System2 should be way faster), userland-software (bash, grep, awk, find are the same versions) and .bashrc (no spiffy config there).

So is it the filesystem? Can I tweak ext3/4 so that it gets as fast as reiserfs?

Thanks for your recommendations!

Edit:
Ok, you're right, I should have provided more info. Now I have to reveal my beginner's bash mumble but here we go:

 declare -a UIDS NAMES TEMPS ANGLEAS ANGLEBS
 ELEM=0
 for i in *html
    do
            #get UID
            UID=${i%-*html}
            UIDS[$ELEM]=$UID

            # get Name
            NAME=`awk -F, '/"name":"/ { lines[last] = $0 } END { print lines[last] }' ${i} | awk '{ print $2 }'`
            NAME=${NAME##\[*\"}
            NAMES[$ELEM]=$NAME

            echo "getting values for ["$UID"]" "("$ELEM "of" $ELEMS")"

            TEMPS[$ELEM]=`awk -F, '/Temperature/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ANGLEAS[$ELEM]=`awk -F, '/Angle A/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ANGLEBS[$ELEM]=`awk -F, '/Angle B/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ### about 20 more lines like these ^^^ 
             ((ELEM++))
 done

Yes, the problem is, that I have to read the file 20 times but putting the content of the file in a variable (FILE=(cat $i)) removes the linebreaks and I can't use awk anymore…? Maybe I tried that wrong so if you have a suggestion for me, I'd be grateful.

Still: the problem remains, that reading a file in that directory just takes too long…

To the hardware-question: well, system1 runs on over 5 year-old hardware, system2 is 2 months old. Yes, the specs are quite different (other mainboards, processors etc.) but system2 is way faster in every other aspect and raw write/read rates to the filesystem are faster too.

Best Answer

Depends what you're doing exactly, but yes, ext file systems get slow when you've got a lot of files in one directory. Splitting the files into e.g. numbered subdirectories is one common way round this.

Related Topic