Our cluster’s main storage partition is a 18 TB RAID-6 array with 10 drives, attached to a “proper” NAS controller (with battery backup) formatted with the XFS file system. For sure you would expect it to be better than a mere 46.6 MB/s on a sequential read.

Reading a 3.3 GB file and send it straight to nowhere gives the following result on an otherwise unoccupied system:

me@master-node:~$ srun -p storage dd if=/data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5  of=/dev/null bs=16M
193+1 records in
193+1 records out
3253807262 bytes (3.3 GB) copied, 69.7864 s, 46.6 MB/s

In previous tests I managed to get around 500 MB/s that is one order of magnitude faster! So what is the reason for this discrepancy, did I measure wrong? Nope. After searching for reasons why a file system can become slow I found one possibility: defragmentation.

Thing of the past?

Wait, is that not a thing from times long gone like Windows 95? I never defragmented my drive on Linux!

Modern file systems, like XFS, use techniques to prevent this file system fragmentation. For more information I will refer the inclined reader to the nice article on Wikipedia about File System Fragmentation. Despite this, there are circumstances when these techniques fail.

So: Nope, it isn’t a thing of the past.

More precisely, most of the time it is, but not always. For example, it happens when the partition is very full and crowded.The file system can not find anymore big enough contiguous blocks to write the file in on single segment, it has to fill up the holes in the file system left from delete and move operations. Like it happened on our partition which we will discuss later. Here is our current status:

nas2:~# df -h /srv/data
file system               Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv_data   16T   16T  920G  95% /srv/data

Unfortunately, whenever somebody is freeing up some memory, somebody else will use it up. That is why we have the cluster after all, to create data. This state is almost perpetual at least since about 2 years.

After some go(o)ggling around I found indeed a defragmentation program for XFS! It’s called xfs_fsr and it is found on Debian in the xfsdump package. But do we really need it?

How to check for defragmentation

I found two possibilities to check for defragmentation. One is to use the xfs_db program to check for:

root@goofy:~# df -h /dev/sda3
file system            Size  Used Avail Use% Mounted on
/dev/sda3             119G   45G   75G  38% /home
root@goofy:~# xfs_db -c frag -r /dev/sda3
actual 66018, ideal 63830, fragmentation factor 3.31%

This was run on a small disk from a desktop. You see that it is hardly fragmented.

Regrettably, it would take forever to check the 16 TB RAID disk just to be told there is a lot of fragmentation. So let’s try another tack: the xfs_bmap tool from the xfsprogs package 1. Here is the description from the man page:

xfs_bmap prints the map of disk blocks used by files in an XFS file system. The map lists each extent used by the file, as well as regions in the file that do not have any corresponding blocks (holes). Each line of the listings takes the following form:

extent: [startoffset..endoffset]: startblock..endblock

Holes are marked by replacing the startblock..endblock with hole. All the file offsets and disk blocks are in units of 512-byte blocks, no matter what the file system’s block size is.

Now let’s see what we get for our file:

nas2:~# xfs_bmap -v /data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5
/data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5:
 EXT: FILE-OFFSET         BLOCK-RANGE              AG AG-OFFSET                TOTAL FLAGS
   0: [0..4119]:          11693362688..11693366807  5 (955947008..955951127)    4120 00111
   1: [4120..8239]:       11696964096..11696968215  5 (959548416..959552535)    4120 00111
   2: [8240..12359]:      11896885760..11896889879  5 (1159470080..1159474199)  4120 00111
...
1230: [6300152..6327359]: 17358946304..17358973511  8 (179081216..179108423)   27208 00111
1231: [6327360..6354815]: 20636416000..20636443455  9 (1309067776..1309095231) 27456 00101
1232: [6354816..6355095]: 20636709312..20636709591  9 (1309361088..1309361367)   280 01111

This means the file is spread over 1232 segments! Well, that could for sure explain the abysmal performance of the sequential read, if you can call it sequential anymore. But does it?

The nice thing about xfs_fsr is that you can defragment single files with it! Just give the file name as an argument. After it is done we want to check if it was really worth it. In order to measure the read performance we should drop first the cache. According to the Linux Kernel Documentation you should sync before dropping the caches:

drop_caches

Writing to this will cause the kernel to drop clean caches, as well as reclaimable slab objects like dentries and inodes. Once dropped, their memory becomes free.

To free pagecache:

echo 1 > /proc/sys/vm/drop_caches

To free reclaimable slab objects (includes dentries and inodes):

echo 2 > /proc/sys/vm/drop_caches

To free slab objects and pagecache:

echo 3 > /proc/sys/vm/drop_caches

This is a non-destructive operation and will not free any dirty objects. To increase the number of objects freed by this operation, the user may run `sync’ prior to writing to /proc/sys/vm/drop_caches. This will minimize the number of dirty objects on the system and create more candidates to be dropped.

So let’s just follow the manual and drop the caches:

root@nas2:~# sync ; echo 3 | tee /proc/sys/vm/drop_caches

Then we can start the read again:

markusro@master-node:~$ srun -p storage dd \
  if=/data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5 \
  of=/dev/null 
6355092+1 records in
6355092+1 records out
3253807262 bytes (3.3 GB) copied, 8.58702 s, 379 MB/s

That’s more in the expected ball park. Now we will let xfs_fsr run for the next 10 hours. The default time out is 2 hours, so we could let it run regularly as a cron job, it will even remember where it was the last time! But for now, just let it run once:

nas2:~# xfs_fsr -t 36000 /srv/data

Just as a curiosity: Without dropping the cache we get much more because the file is still in RAM:

markusro@master-node:~$ srun -p storage dd \
	if=/data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5  \
	of=/dev/null 
6355092+1 records in
6355092+1 records out
3253807262 bytes (3.3 GB) copied, 3.75013 s, 868 MB/s

Finally, let’s check the output of xfs_bmap:

nas2:~# xfs_bmap -v /data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5
/data/rebecca/EG_TIP4P_kristPore/T300/hydrophobeWand/H5Skript/hydIndex300_0_5.h5:
 EXT: FILE-OFFSET      BLOCK-RANGE              AG AG-OFFSET              TOTAL FLAGS
   0: [0..6355095]:    32249392672..32255747767 15 (37145632..43500727) 6355096 01111

That means the tool has indeed managed to put the file in a single segment. Let’s hope the performance will speed up our analyses…

  1. I found this nice article about these tools: Use xfs_fsr to keep your XFS file system optimal