Using XARGS to speed up batch processing

1 minute read

[Edit 24/7/2013] Be careful when using xargs to spawn multiple processes that write to the same file. I've been using it with md5sum and piping the output to a file. In data sets with a large number of files, several hundred hashes would be missing in the final file. A more reliable way is to use md5deep.

In our last single we compared md5sum, md5deep and openssl md5 to determine which is fastest at calculating different hashes. For that experiment we were using 1 thread. Essentially, a file comes in, it gets hashes, the next file comes in. With 1 threat, we saw 1 out of the 4 processors being used.

However, if we are processing a list of files, why not split the work into multiple jobs and use all of the processors. md5deep does this by default (4 threads), but splitting the jobs for any task is easily done with xargs.

xargs is a program that allows you to control arguments coming in from a pipe. For example, if we type 'find -type f | md5sum' then the list of found files would be hashed as one glob of data. To give each file name to md5sum, instead of all file names, we can use xargs to control how the output of find is piped to md5sum. For example, 'find -type f | xargs -n1 md5sum' will feed one line at a time to md5sum, allowing md5sum to find the file and hash it.

You can also tell xargs how many threads to create using the -P switch. Since we have 4 processors, we will use 4 threads in this case.

Consider in the last single, were the times for hashing all files in a directory were as follows:
md5sum: real 3m46.117s
md5deep: real 3m57.595s
openssl: real 3m43.142s

(example: 'find -type f | xargs -P4 -n1 md5sum')
Using xargs (and all of the processors), the real times can be reduced to:
md5sum: real 1m57.196s
md5deep: real 2m5.408s
openssl: real 1m56.084s

md5deep recursive, default threads (4), no xargs:
real 2m0.521s

Note: the user and system time is the same as before because the same amount of processor time is being devoted to the tasks, however, there are more processors so the real time is much less.

Also, consider that all the files in the directory are different sizes, with the largest being 10GB. The majority of the time for each program was spent with 1 thread hashing the largest file, while the other processors had already finished hashing all the other files.

Leave a Comment