RAID5 with mdadm: Tweaking

After playing around with the individual drive and array defaults I found that the performance can be improved substantially with a few easy tweaks.

Useful resources

https://raid.wiki.kernel.org/index.php/Performance

http://randomitblog.blogspot.com.au/2009/10/ubuntu-raid-tweak.html

I have looked long and hard for formulas in that can be used to obtain results that make sense but ultimately it comes down to testing each value and doing a benchmark test in order to 'measure' the difference it makes in your environment.

A few settings that can be adjusted are listed below. I list them in the order in which I would apply the settings during benchmarking testing, starting with the settings with the biggest impact and ending with the settings with smaller impact according to my findings.

Settings applied to the mdadm RAID array with defaults on my system (Ubuntu 12.04):

**Summary of mdadm RAID Array settings**
Command to apply setting	Default value	Tweaked value	Description
Blockdev --setra 20480 /dev/md126	8192	102400	Read-ahead
echo 5120 > /sys/block/md126/md/stripe_cache_size	256	5120	Stripe-cache size
echo 100000 > /sys/block/md126/md/speed_limit_max	?	100 000	max speed

It is important to keep in mind that the stripe_cache_size will use a portion of RAM. For example a mdadm RAID array such as mine will use:
stripe_cache_size * block some * number of disks
=32768 * 4k * 4 (active disks)
=512MB of RAM

In my case I have 4GB of RAM and the functions performed on the machine are pretty basic so it is of little concern.

Settings applied to each drive with defaults on my system (Ubuntu 12.04)

**Summary of mdadm RAID Array settings**
Command to apply setting	Default value	Tweaked value	Description
Blockdev --setra 20480 /dev/md126	8192	102400	Read-ahead
echo 1 > /sys/block/sdX/queue/queue_depth	31	1	NCQ Queue Depth
echo 64 > /sys/block/sdX/queue/nr_requests	128	64	Nr of requests
echo deadline > /sys/block/sdX/queue/scheduler	default noop deadline [cfq]	deadline	Scheduler

After hours of testing and a massive spreadsheetI have values that provide substantial performance gains. Here are some benchmark tests.

iozone -s 800m -f /TEMP/iozonefile
        Iozone: Performance Test of File I/O
                Version $Revision: 3.397 $
                Compiled for 32 bit mode.
                Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
Run began: Sat Jan  5 08:03:36 2013

File size set to 819200 KB
        Command line used: iozone -s 800m -f /TEMP/iozonefile
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          819200       4  245870  255925  3488436  3411314 2739661    2915 3064504  2711369  2818608   225342   248360 3388021  3339569

iozone test complete.
root@rick-desktop:~# dd if=/dev/zero of=/TEMP/testfile bs=100M count=50
50+0 records in
50+0 records out
5242880000 bytes (5.2 GB) copied, 22.1998 s, 236 MB/s
root@rick-desktop:~# sudo hdparm -tT /dev/md12[7]

/dev/md127:
 Timing cached reads:   10708 MB in  2.00 seconds = 5358.12 MB/sec
 Timing buffered disk reads: 1252 MB in  3.00 seconds = 416.69 MB/sec
root@rick-desktop:~# 
root@rick-desktop:~# 
root@rick-desktop:~# bonnie++ -u rick -s 8192 -d /TEMP
Using uid:1000, gid:1000.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
rick-desktop     8G   550  90 233956  32 136714  21  2079  93 388922  24 404.8  10
Latency             19655us     222ms     287ms   20117us   74176us     975ms
Version  1.96       ------Sequential Create------ --------Random Create--------
rick-desktop        -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               260us    1693us     608us     514us      24us     584us
1.96,1.96,rick-desktop,1,1357320085,8G,,550,90,233956,32,136714,21,2079,93,388922,24,404.8,10,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,19655us,222ms,287ms,20117us,74176us,975ms,260us,1693us,608us,514us,24us,584us

rsync --progress /Storage/rhel-server-6.0-x86_64-dvd.iso  /TEMP/
rhel-server-6.0-x86_64-dvd.iso
  3431618560 100%   63.97MB/s    0:00:51 (xfer#1, to-check=0/1)

sent 3432037552 bytes  received 31 bytes  65372144.44 bytes/sec
total size is 3431618560  speedup is 1.00

Before I apply the values persistently I will reset the values by restarting the machine..

dd if=/dev/zero of=/TEMP/testfile bs=100M count=50
50+0 records in
50+0 records out
5242880000 bytes (5.2 GB) copied, 34.9882 s, 150 MB/s

bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   8     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
rick-desktop    16G   589  96 145250  19 83253  12  1969  91 312312  20 391.7  11
Latency             24152us   81274us     357ms   18178us     224ms     215ms
Version  1.96       ------Sequential Create------ --------Random Create--------
rick-desktop        -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                100 40349  50 +++++ +++ 57160  44 41672  43 +++++ +++ 52186  43
Latency             12521us     584us    1230us     423us      51us   12334us

rsync --progress /Storage/rhel-server-6.0-x86_64-dvd.iso  /TEMP/
rhel-server-6.0-x86_64-dvd.iso
  3431618560 100%   53.86MB/s    0:01:00 (xfer#1, to-check=0/1)

sent 3432037552 bytes  received 31 bytes  54912601.33 bytes/sec
total size is 3431618560  speedup is 1.00

The key benchmark with the iozone test is Stride Read, so we compare that now. 2657627 before vs 2818608 after. dd test, 150 MB/s before and 236 MB/s after.

Let's look at the bonnie output. I will spend the most energy on this as I think this is the most informative benchmark: As expected, we see that the Sequential block output is similar to the dd output. Doing dd tests have actually been redundant as the results are also contained in the bonnie output as well, but for the sake of thoroughness I did both tests.
Sequencial block input or read isn't much improved by the tweaking. This is not ideal, although it i reality unfortunately. Read and writes are a balancing act as a read performance improvement will mostly cause a reduction in write performance.
Sequential block rewrite is reading data and then writing it, so it is essentially the reading and writing performance combined. In this case, 103900 with defaults and 136714 with the tweaks in place.
Random seeks are how many random blocks bonnie can read, in this case 519 vs 404.
The +++++ means that the measurement is fast too the point where the error margin is a sizeable percentage of the measurement and the result is therefore inaccurate.

Here is a discussion on a script that tweaks the mdadm RAID array automatically. I used this as a reference although I found some of the settings mentioned here not to make a great difference.

http://ubuntuforums.org/showthread.php?t=1916607

I found that the best way to tweak was to choose some baseline value based on manual tweaking and testing and from there run through a number of values on one setting and compare them to each other. I used the following script to save some time:

#!/bin/bash
DISKS="sdg sdi sda sdd sdc"
#bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick
blockdev --setra 102400 /dev/md126 
for i in $DISKS ; do blockdev --setra 20400 /dev/$i ; done 
echo 5120 > /sys/block/md126/md/stripe_cache_size

for i in $DISKS ; do echo 1 > /sys/block/$i/device/queue_depth ; done

for i in $DISKS ; do echo 32 > /sys/block/$i/queue/nr_requests ; done

sleep 1m
smartctl -a /dev/sdi | grep Temp
bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick 
smartctl -a /dev/sdi | grep Temp
for i in $DISKS ; do echo deadline > /sys/block/$i/queue/scheduler ; done
sleep 1m
bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick

for CACHE_SIZE in 256 512 768 1024 2048 4096 5120 8192 16834 32768; do
        echo $CACHE_SIZE > /sys/block/md126/md/stripe_cache_size
        echo $CACHE_SIZE /sys/block/md126/md/stripe_cache_size     
        sleep 1m
        bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick     
done

echo 4096 > /sys/block/md126/md/stripe_cache_size
for ARRAY_RA in 8192 16384 32768 65536 192499 131972; do      
        for DISK_RA in 256 512 768 1024 2048 4096 8192 16834 32768; do          
                for i in $DISKS ; do blockdev --setra $DISK_RA /dev/$i ; done
                        sleep 1m                              
                        echo Array_RA: $ARRAY_RA Disk_RA $DISK_RA
                        blockdev --setra $ARRAY_RA /dev/md126
                        bonnie++ -d /TEMP/ -c 8 -r 8000 -n 100 -u rick 
        done
done
smartctl -a /dev/sdi | grep Temp

In order to analyse the output I used a simple greps like below. I used screen to run the benchmark and logged all the screen output with the -L option.

Importing this into Excel and using the conditional formatting makes digging through the number easier. It is clear from the numbers below that there will not be a size fits all solution. The Sequential input and output is an example of settings that play off against each other.

Another interesting observation is the large impact the /sys/md126/md/queue/scheduler setting has on the sequential block input and output.

Also, it is useful to note that more cache isn't always better

My choice has been made and I will use this script below to configure it after reboot.

In the next post I plan to implement these values persistently.

RAID5 with mdadm

Friday 4 January 2013

Tweaking

1 comment: