Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2015-08-01 21:13
Home Lab Project: Storage
2012-02-15 11:14
Filesystems & Fragmentation

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Filesystems & Schedulers with CompactFlash

I've been having disk performance problems with both my server and my workstation so I decided it was time to sort it all out and get them tuned properly. Both of them get a real thrashing: the workstation has to handle me with ~20+ tabs open in a browser, loads of applications, playing and often editing media, running VMs, and much more, while the server handles video/TV, CCTV, some OpenVz containers for serving sites and mail, plus a stack of monitoring.

It's very easy to just say that I should just get faster machines and that's often what people do. Trouble is it's not really solving the problem (efficiency) and I've built extreme custom machines for specific applications before, but here I need general purpose machines which gracefully handle the huge diversity of tasks I throw at them and are versatile enough to do much more in the future. I also prefer to tune rather than brute force problems - a situation that inherently doesn't scale can only be pushed so far with brute force.

If ever there was proof needed of this, some tests here showed over a 60:1 ratio in some aspects of performance depending on the choice of filesystem and scheduler.

Strategy

Due to the different types of problems the first thing I think I have to accept is that there is no one-size-fits-all solution. Different things require different approaches.

It's always worth monitoring live performance with iostat:

# iostat -x sdb 10

That will monitor /dev/sdb at 10 second intervals. Note that the first set of data it outputs is the totals/average since boot. Leave out the device spec and it will show all available devices. Without the time spec it will only show the totals since boot. The -x option is for extended stats which includes latency information which is vital for me.

Cacti polling disk thrash

Every 5 minutes Cacti polls about 1000 (probably more) data sources and updates a bunch of .rrd files. The results is an extreme disk thrash that cripples the server momentarily. iostat shows the problem on the volume where the .rrd files are:

rrqm/s   wrqm/s  r/s    w/s    rsec/s  wsec/s   avgrq-sz  avgqu-sz  await   svctm  %util
0.00     0.00    0.00   20.40  0.00    220.80   10.82     8.92      664.04  4.59   9.36

Cacti polls all the data sources and then when complete updates all the .rrd files. That means we see one big disk trash and other disk users like databases suddenly have very high latency and log a stack of slow queries. In terms of actual IO bandwidth that's only 220 sectors/second or 110KBytes/second - really nothing at all! The problem is that it's a stack of seeks which results in the huge average wait time.

The actual .rrd area used by Cacti is tiny: 118MB. The obvious solution here would be some sort of Flash device. I'm loathe to fork out hundreds of £ on a relatively large and potentially unreliable SSD for this tiny area, but a cheap CompactFlash could do the trick. I've used CF for tiny platforms before and it stood up well and looked like it should stand up for >3years of use on a machine that was logging intensively. I also designed and ran testing for fully custom Flash devices previously which involved an accelerated ageing process and extreme numbers of write cycles - they went way beyond spec and none failed. Flash memory can be reliable. I suspect the problems with SSD are more quality control, design and/or controller firmware bugs than inherent flash reliability problems.

Benchmarks

The key thing to remember with all this is that CompactFlash is highly optimised for camera usage and these tests are really making it do stuff it was never designed for, hence the performance is very different to how it would behave in a camera.

Initially I benchmarked some cards: Lexar Professional 233x UDMA, Transcend 133x and a Duracell (made by Dane-Elec) 300x UDMA card - all 8GB.

Lexar, Transcend and Duracell CF cards tested

The main tests where done using a CF-IDE adaptor:

CF to IDE adaptor used

One thing to note is that some CF cards do not like driving any significant lengths of cable so an adaptor that goes directly into the IDE connector on the motherboard is a good choice. Also, beware of two-slot CF adaptors as some CF cards can't coexist on the same bus with a second device so the second slot may have to be left empty.

Seek tests where with seeker.c which gave some interesting results:

  • Lexar: 0.37 ms random access time
  • Transcend: 2.23 ms random access time
  • Duracell: 0.60 ms random access time

The curious thing here is the very slow random access time for the Transcend. This is Flash, no moving parts but this shows an unusually long seek time. That said, it's still much faster than virtually any mechanical disk.

In terms of testing, a degree of parallel access is always going to be happening. I am using iozone for testing with 4 threads specified and 4k read/write sizes to roughly match what iostat shows with real-world access.

The Achilles heel of CF is random write time and that's what matters for Cacti .rrd files. In this case I am testing with:

# iozone -T -t 4 -s 1m -r 4k -T -I -i0 -i1 -i2

This uses 4k operations to roughly match the request sizes show by iostat, and 1M test sizes to avoid taking a really long time about testing. With the -I option caching/buffering is not going to skew results so the file sizes are less important.

I repeated the tests with a USB reader as well for comparison.

The big surprise is just how dramatically the choice of filesystems and schedulers affects performance. I have compiled my tests into a spreadsheet along with results from a SATA disk and a Dane-Elect USB Flash drive.

Download: CF Performance tests

Filesystems and Schedulers for Compact Flash

There are two main IO schedulers for Linux: cfq and deadline, plus the seldom used minimal noop scheduler. Some distros also have the anticipatory scheduler though the deadline scheduler will almost always perform better under circumstances where anticipatory scheduling would be useful. Schedulers can be set by writing the scheduler name to the relevant node in /sys - eg. for /dev/sda you can set the scheduler with:

# echo deadline >/sys/block/sda/queue/scheduler

When you have decided on the scheduler and want to have it set at boot then the sysfsutils packages provides a configuration for this and you can add to /etc/sysfs.conf a line for each disk:

block/sda/queue/scheduler = deadline

Filesystem choice does make some huge differences with my tests. As I'm centring around small "random" writes for the Cacti update this is what I will look at.

Meet the real world

I initially started with the Duracell CF card as it has SMART which is useful to know how it is wearing.

The same config (cfq & ext2) that was giving me random write at nearly 1MB/s (plenty for Cacti) was struggling to sustain 350KB/s when it was brought into operation with Cacti for real, and was reaching average wait times over 10 seconds. I then re-did testing using the time it took for Cacti to finish writing all the updates to the .rrd files. With cfq the results where:

  • ext2: 117s
  • ext3: 118s
  • ext4: 123s
  • ext4nj: 131s
  • jfs: 130s
  • xfs: 97s
  • xfslazy: 117s

This is quite a turn up. While the lower overheads of ext2 had won in the benchmarks in the real world it was convincingly beaten by straight xfs for this particular access pattern. The other interesting thing is that enabling the lazy option with xfs actually made it perform worse with with the real-world data on CF.

Further tests where done with xfs and schedulers:

  • xfs/cfq: 94-96s
  • xfs/deadline: 74-78s
  • xfs/noop: 103-130s

For comparison a SATA drive with jfs and deadline scheduling does the task in 32-45s. Interestingly the SATA disk ended up doing a few high-speed bursts where it wrote 10's of thousands of sectors/s where the CF ended up with a backlog of requests queued and was often only doing about 200 sectors/s.

Although the Transcend card performed well enough for this application on the benchmarks this also had very poor performance with the actual data and was significantly worse than the Duracell card.

The main thing learned here is that CF cards are designed for use in cameras which are dropping relatively large single files at a time - this application runs completely against that. It's no surprise that the cards are not performing at all well under conditions they where not designed for.

Summary: Schedulers

This is not definitive, only my observations with the limited tests and tuning I have done for my own applications. Access patterns for different configurations and applications may well benefit different approaches and the only way to be certain is to run your own trials, preferably with real access patterns. OS caching, buffering and network effects can also change the equation dramatically.

If overall bandwidth is the goal with single-process access then cfq appears to be a good choice in many cases. Total throughput is often slightly higher than other schedulers in this scenario but in the case of slower storage and concurrent access, low bandwidth processes are often starved in favour of high bandwidth processes and overall throughput can suffer. This appears to be particularly true in the case of xfs where severe imbalances occur between the iostat threads with a significant impact on overall throughput. The combination of xfs with cfq apears to be a potential major problem where concurrent access is occurring depending on the access pattern.

In the case of slow seek times but otherwise fast storage (eg. SATA drives) this can dramatically reduce the amount of seeking and be a significant advantage for overall throughput by allowing one process to dominate. The downside is that the other processes can become unresponsive. This is often seen on desktop machines as dramatic slow-downs, unresponsiveness and even brief freezes during times of high concurrent IO. This has been one of the complaints that has come up repeatedly in forums and a combination of cfq with the filesystem and access patterns may well be the cause in many cases.

With concurrent access, deadline scheduling will often result in increased seeking when there is concurrent access but this is not necessarily a problem when using Flash storage as "seek" times are good where with mechanical disks it may result in significant total throughput reduction.

This is highly dependant on the application, access pattern and device performance and in some cases using the deadline scheduler may be advantageous even when it results in lower overall throughput simply for the improved responsiveness where any sort of user interaction is involved (eg. desktops, web applications, database servers and more)

Summary: Filesystems

CF is very sensitive to the choices of filesystems and access patterns. The key thing that seems to show up here is that devices are massively different in behaviour and sometimes relatively small differences in access patterns can have a massive impact on performance. As can be seen here the benchmarks didn't match reality at all well: sometimes you will find there is no substitute for tuning with live data.

Again this is just my observations based on my applications and things may work very differently for you, your access patterns and your storage devices.

The surprise here is that for random write performance xfs may well turn out to be the best choice with CF, especially slower cards. For linear writes xfs seems to perform very badly with CF but this is highly dependant on access patterns so depending on your application results may be radically different.

Other filesystems may also show similarly unexpected behaviour but that is beyond the scope of this article.

As there seems to be no consistent rules here, the only thing I can suggest is that testing with live data is done to determine the best filesystem for your particular situation.

As can be seen here the popular belief that using the simplest filesystem (eg. ext2) gives best performance with Flash may only apply in some cases with specific devices and access patterns. The device access pattern caused by the filesystem interacting with different aspects of the device performance is often a far larger impact than the complexity and overall number of IO operations.

Winners & Loosers

The bottom line here is that I decided to ditch the idea of using using CF for areas of high random-access. CF appears to be so highly optimised for camera use that access patterns like this simply do not perform in a predictable way that resembles the design performance of the CF card.

Far more effective has been leaving the Cacti .rdd files on a SATA disk (actually now a small RAID of SATA disks) but taking measures against fragmentation as many of the .rra files turned out to be highly fragmented. Just combating the fragmentation has made a huge difference and the performance is no longer a problem.

Thoughts on Flash

The one thing that I think is becoming increasingly apparent with the increasing use of Flash storage and the results here is that we are force-fitting parts of a system together that simply do not work well. I think this is often a symptom of transitions that during this "step along the way" things are a long way from optimum.

To a large extent the core problem is that all our current major filesystems and disk-access strategies are built around mechanical disk drive operation and do not exploit the advantages of Flash nor avoid the disadvantages. Filesystems avoid fragmentation but with Flash fragmentation is not an issue with the negligible seek times while small writes scattered through different locations can result in many larger blocks being erased and re-written for each small write with the consequent performance hit.

To a large extent what SSDs are doing is applying an abstraction layer to compensate for the behaviour mechanical disk optimised filesystems and access patterns rather than the whole system being designed around taking advantage of Flash storage characteristics.

Hopefully the next few years will see operating systems responding to the changes and introduce filesystems (or maybe just tuning options) purposely designed to exploit Flash storage.