It’s common in cluster enviroments like HPC , for processes to produce Zillions of Tiny (ZOT) files, often when running Job Arrays via the scheduler. Users often use the filename as crude checkpointing or workflow control. Often these files are only a few bytes long, leading to tens of thousands of such files being created during a run. The problem is the overhead to perform disk operations on these files – read, write, directory listing, transferring directories, becomes overwhelming to the system leading to rare cases when the OS or applications will time out waiting for these file ops to complete.
This document will address the issue of ZOT files as well as provide solutions that use file locking to allow many jobs to write to the same file without loss of data.
In a cluster analysis, it’s not uncommon for users to produce multiple directories of 20,000+ files where each file is only a few bytes. When performing disk operations on these directories (reads, writes, listing, copies, & moves), the overhead is extremely costly when multiplied by all files contained in it. In extreme cases, the file system may run out of spare inodes despite the disk not being full.
It is common with clusters that use distributed file systems like GlusterFS or Lustre for multiple compute nodes to read from and write to a single directory or even to a single file simultaneously. When the file system receives tens of thousands of requests from multiple nodes, delays eventually propagate during the synchronization and may cause some jobs to fail to read and write correctly due to to default timeout periods. These are quite rare, but these kinds of failures become frequent due to the large number of jobs being run.
Is is strongly suggest that instead of using ZOT files as a synchronization or as logging technique, that programs are written to use a single file as output and control the access to it with file locking .
File locking is a file system level flag that allows only a single process to read or write at one time. For example, you can set this flag so a process cannot write to a file that is already being written to; you can assure that one process has finished writing before another starts. By using file locking, we can guarantee mutual write exclusion or ‘mutex’ so that no two processes are writing to the same file at the same time. In the case where multiple processes on the same (or even different) compute node(s) try to write to a file that is locked, all processes will wait their turn to write to the file until the access flag is unlocked by the process that locked it.
File locking will guarantee that when running large number of array jobs, each write will be completed and no simultaneous contention will occur during the process.
In Perl
file locking ability exists in the flock
library. By using the flock
library, we are able to set the file attributes for the file system, eliminating the ZOT problem and guarantee that only one process can write to a file. The available flock
parameters are listed below.
Code | Meaning |
---|---|
1 | Shared Lock |
2 | Exclusive Lock |
4 | Non-blocking |
8 | Unlock |
Below is a sample Perl script that can write data to a single file from all our array jobs. The important part is the use of the flock
function flock(DATA, 2)
and the close DATA;
function once the data has been written. This code snippet attempts to write to a file but if the file is already locked, it will wait until it’s available. After it acquires the lock, it will write the data to the file and once done, close the file, and finally surrender the lock to the next process.
#!/usr/bin/env perl
if(($#ARGV + 1) == 2) {
$sharedFile = $ARGV[0];
$data = $ARGV[1];
} else {
print "$0 requires two paramenters, filename and data.\n";
exit 1;
}
writeToFile();
sub writeToFile {
open(DATA, ">>", "$sharedFile") or die("Could not open $sharedFile");
flock(DATA, 2) or die("Could not lock file");
# Pb(1) lock the file, I am writing
print DATA "$data\n" or die("Could not write to memory");
close DATA; # Vb(1) unlock the file and release.
}
exit 0;
The above script takes 2 arguments
and then keeps trying to write to the file until it’s successful. Another example that may work better is shown below.
A use case for this kind of script would be to redirect all necessary output from your workflow to the above program and pass it as a string to the second argument. The first argument would be the path/name of the file to write to. So rather than writing 1000 files with names that indicate parameters (all too common) such as 1_var1_var2_var3_var4.data
which contains only a few hundred bytes of actual data, you could compose an output string like this:
date:time:var1:var2:var3:var4:data1:data2:data3:etc
and then have all 5000 of your array jobs write to ONE file like this:
/path/to/your/app | skel-flockwrite.pl /path/to/output/file/name
When you write your qsub scripts, all jobs would use the same output file, taking turns to
While this sounds slightly insane - a traffic jam on steroids - it works surprisingly well on the HPC cluster at UCI with the locks rarely causing refusals.