Linux – How to test the effect of ionice (against a device using the cfq scheduler)

ionicelinuxperformance-tuningschedulerstorage

I'm trying to construct an experiment to measure the effect of ionice. What I'd like to do is to (per another answer on serverfault) cause frequent enough I/O that a sufficiently "niced" process is starved of any I/O.

Based on another answer on serverfault, I think I need to cause at least one actual I/O operation to a common cfq-scheduled device every 250ms. My thought was to write a trivial program that does has a loop that

writes to a (configurable) file on the common device,
does an fsync() (to force a definite I/O operation),
uses usleep() to delay a configurable amount of time
periodically uses lseek() to truncate the file (so that I don't fill the file system)

I then start up one instance of the program using ionice -c3 (idle scheduling class) against one file on the common device. I simultaneously run various instances with the default (best-effort) scheduling class, specifying a different file on the common device (varying the delay values).

My hypothesis was that for delay values of 250ms or more on the "best-effort" process, I would see progress made by the "idle" process; for values less than 250ms, I would see little to no progress made by the "idle" process.

My observation was that there was no difference in performance of the two processes; they both made similar progress. Just to be sure (in case the wall-clock indications that "best-effort" process was performing I/O much faster than every 250ms), I started multiple simultaneous instances of the "best-effort" process, specifying no (zero) delay. Still, I saw no difference in performance between the processes in the two scheduling classes.

Just to be sure, I re-checked the scheduler class:

$ cat /sys/block/xvda/queue/scheduler
noop anticipatory deadline [cfq]

What is it that I'm missing about how the cfq scheduler works?

If it matters, this is on a 2.6.18 kernel.

Best Answer

I'd try measuring the effect by using a load generator like stress -i n or stress -d n, where "n" is the number of processes. Run that in one window. Try nmon or iostat in another and try a representative application process on the same block device. See how service time changes in iostat with various ionice settings (or test response from within your app).

As for cfq, there seemed to be changes throughout the RHEL5 lifecycle (on 2.6.18). It was noticeable enough on my application servers that I had to move to the noop and deadline elevators because of contention issues.

Related Solutions

HFSC Scheduling – Understanding HFSC Scheduling in Linux/BSD

Reading the papers on HFSC and its cousins is not a good way of understanding it. The primary goal of the HFSC paper is to provide a rigorous mathematical proof of its claims, not explaining how it works. Indeed you can't understand how it works from the HFSC paper alone, you have to read the papers it references as well. If you have some problem with the claim or their proofs then contacting the authors of the papers might be a good idea, otherwise I doubt they will be interested in hearing from you.

I have written a tutorial for HFSC. Read it if my explanations below aren't clear.

What for do I need a real-time curve at all? Assuming A1, A2, B1, B2 are all 128 kbit/s link-share (no real-time curve for either one), then each of those will get 128 kbit/s if the root has 512 kbit/s to distribute (and A and B are both 256 kbit/s of course), right? Why would I additionally give A1 and B1 a real-time curve with 128 kbit/s? What would this be good for? To give those two a higher priority? According to original paper I can give them a higher priority by using a curve, that's what HFSC is all about after all. By giving both classes a curve of [256kbit/s 20ms 128kbit/s] both have twice the priority than A2 and B2 automatically (still only getting 128 kbit/s on average)

The real time curve and the link sharing curve are evaluated in different ways. The real time curve takes into account what a class has done throughout it's entire history. It must do that to provide accurate bandwidth and latency allocation. The downside it isn't what most people intuitively think of as fair. Under real time, if a class borrows bandwidth when nobody else wants it, it is penalised when someone else wants it back later. This means under the real time guarantee a class can get no bandwidth for a long period because it used it earlier, when nobody wanted it.

The link sharing is fair, in that it doesn't penalise a class for using spare bandwidth. However this means it can't provide strong latency guarantees.

The separation of link sharing from providing latency guarantees is the new thing HFSC brings to the table, and the paper says as much in it's opening sentence: In this paper, we study hierarchical resource management models and algorithms that support both link-sharing and guaranteed real-time services with decoupled delay (priority) and bandwidth allocation. The key word in that sentence is decoupled. Translated, it means you are expected say how unused bandwidth is to be shared with ls, and specify what real time guarantees (aka latency guarantees) are needed with rt. The two are orthogonal.

Does the real-time bandwidth count towards the link-share bandwidth?

Yes. In the HFSC paper they define bandwidth in terms of what the class has sent since the class has become backlogged (ie has packets waiting to be sent). The only difference between rt and ls is rt looks forward from every time time class became backlogged and computes the lowest guaranteed bandwidth from that set, whereas link sharing only looks from the last time the class became backlogged. So rt takes more bytes into consideration than ls, but every byte ls considers is also considered by rt.

Is upper limit curve applied to real-time as well, only to link-share, or maybe to both?

The upper limit does not stop packets from being sent to satisfy the real time condition. Packets sent under the real time condition still count towards the upper limit, and thus may well delay a packet being sent under the link sharing condition in the future. This is another difference between real time and link share.

Assuming A2 and B2 are both 128 kbit/s, does it make any difference if A1 and B1 are 128 kbit/s link-share only, or 64 kbit/s real-time and 128 kbit/s link-share, and if so, what difference?

Yes, it does make a difference. As explained above if you use real time, the latencies are guaranteed but the link is not shared fairly (and in particular the class could suffer bandwidth starvation), and the upper limits are not enforced. If you use link share the latency is not guaranteed, but link sharing is fair, there is no risk of starvation, and upper limit is enforced. Real time is always checked before link share, however this doesn't necessary mean the link share will be ignored. That's because packets are counted differently. Since history is considered by real time a packet may not needed to meet the real time guarantee (because of one counted included from the history) but is needed to satisfy link share because it ignores the historical packet.

If I use the seperate real-time curve to increase priorities of classes, why would I need "curves" at all? Why is not real-time a flat value and link-share also a flat value? Why are both curves? The need for curves is clear in the original paper, because there is only one attribute of that kind per class. But now, having three attributes (real-time, link-share, and upper-limit) what for do I still need curves on each one? Why would I want the curves shape (not average bandwidth, but their slopes) to be different for real-time and link-share traffic?

The curve for real time controls allows you to trade tight latency for one particular traffic class (eg VOIP) for poor latency for other traffic classes (eg email). When deciding which packet to send under the real time constraint HFSC chooses the one that will complete sending first. However, it doesn't use the bandwidth of the link to compute this, it uses the bandwidth allocated by the real time curve. Thus if we have VOIP and email packets of the same length, and the VOIP packet has a convex curve that gives if a 10 fold boost over the concave curve for email, then 10 VOIP packets will be sent before the fist email packet. But this only happens for the burst time, which should be at most the time it takes to send a few packets - ie milli seconds. Thereafter the VOIP curve should flatten out, and VOIP will get no future boost unless it backs off for a while (which, given VOIP is a low bandwidth application, it should). The net result of all this is to ensure those first couple of VOIP packets are sent faster than anything else, thus giving VOIP low latency at the expense of email getting high latency.

As I said earlier, because link share does not look at history it can't give latency guarantees. A rock solid guarantee is what is needed for real time traffic like VOIP. However, on average a link shared convex curve will still provide a latency boost to its traffic. It's just not guaranteed. That's fine for most things, like web traffic over email.

According to the little documentation available, real-time curve values are totally ignored for inner classes (class A and B), they are only applied to leaf classes (A1, A2, B1, B2). If that is true, why does the ALTQ HFSC sample configuration (search for 3.3 Sample configuration) set real-time curves on inner classes and claims that those set the guaranteed rate of those inner classes? Isn't that completely pointless? (note: pshare sets the link-share curve in ALTQ and grate the real-time curve; you can see this in the paragraph above the sample configuration).

The documentation is correct. The hierarchy (and thus inner nodes) has no effect on the computation of real time whatsoever. I can't tell you why ALTQ evidently thinks it does.

Some tutorials say the sum of all real-time curves may not be higher than 80% of the line speed, others say it must not be higher than 70% of the line speed. Which one is right or are they maybe both wrong?

Both are wrong. If 70% or 80% of your traffic has hard latency requirements that must be specified with real time, the reality is you almost certainly you can't satisfy them with the link you are using. You need a faster link.

The only way someone could think specifying 80% of the traffic should be real time is if they were treading real time as a priority boost. Yes, in order to provide latency guarantees you are in effect boosting the priority of some packets. But it should only be for milliseconds. That's all a link can cope with and still provide the latency guarantees.

There is very little traffic that needs latency guarantees. VOIP is one, NTP another. The rest should all be done with link sharing. It you want web to be fast, you make it fast by allocating it most of the links capacity. That share is guaranteed over the long term. It you want it to be low latency (on average), give it a convex curve under link sharing.

One tutorial said you shall forget all the theory. No matter how things really work (schedulers and bandwidth distribution), imagine the three curves according to the following "simplified mind model": real-time is the guaranteed bandwidth that this class will always get. link-share is the bandwidth that this class wants to become fully satisfied, but satisfaction cannot be guaranteed. In case there is excess bandwidth, the class might even get offered more bandwidth than necessary to become satisfied, but it may never use more than upper-limit says. For all this to work, the sum of all real-time bandwidths may not be above xx% of the line speed (see question above, the percentage varies). Question: Is this more or less accurate or a total misunderstanding of HSFC?

It is a good description upper limit. While the link share description is strictly accurate, it is misleading. While it's true link share doesn't give the hard latency guarantees like real time does, it does a better job of giving the class its bandwidth its allocation than its competitors, like CBQ and HTB. So in saying link share "doesn't provide guarantees" it is holding it to a standard higher that any other scheduler can provide.

Real time's description manages to be both accurate, but so misleading I'd call it wrong. As the name implies real time's purpose is not to give guaranteed bandwidth. It is to provide guaranteed latency - ie the packet is sent NOW, not some random amount of time later depending on how the link is used. Most traffic does not need guaranteed latency.

That said, real time doesn't give perfect latency guarantees either. It could, if the link wasn't managed by link share as well, but most users want the additional flexibility of having both and it doesn't come for free. Real time can miss it's latency deadline by the time it takes to send one MTU packet. (If it happens, it will be because it was a MTU packet link share snuck in while real time keeping the link idle in case it was given packet with a short deadline that had to be sent immediately. This is yet another difference between link share and real time. In order to provide its guarantees real time may deliberately keep the line idle even though there are packets to send, thus provided less than 100% link utilisation. Link share always uses 100% of the links available capacity. Unlike real time, it can be said to be "work preserving".)

The reason real time can be said to offer hard latency guarantees is the delay is bounded. So if you are trying to offer a 20ms latency guarantee on a 1Mbit/sec link, and link share is sending MTU sized (1500 byte) packets, you know one of those packets will take 12ms to send. Thus if you tell real time you want a 8ms latency, you can still meet the 20ms deadline - with an absolute guarantee.

And if assumption above is really accurate, where is prioritization in that model? E.g. every class might have a real-time bandwidth (guaranteed), a link-share bandwidth (not guaranteed) and an maybe an upper-limit, but still some classes have higher priority needs than other classes. In that case I must still prioritize somehow, even among real-time traffic of those classes. Would I prioritize by the slope of the curves? And if so, which curve? The real-time curve? The link-share curve? The upper-limit curve? All of them? Would I give all of them the same slope or each a different one and how to find out the right slope?

There isn't a prioritisation model. Seriously. If you want to give traffic absolute priorities, use pfifo. That is what it is for. But also beware that if you give web traffic absolute priority over email traffic, that means letting web traffic saturate the link and thus no email packets getting through, at all. All your email connections then die.

In reality no one wants that sort of prioritisation. What they want is what HFSC provides. If you actually have real time traffic, HFSC provides that. But it will be stuff all. For the rest, HFSC allows you to say "on a congested link, allocate 90% to web and let email trickle along at 10%, and oh send the odd DNS packet quickly but don't let someone DOS me with it."

Linux Hardware RAID Controller Tuning – Real-World Guide

I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.

Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.

As for other options or settings for block device queues:

max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)

nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)

rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs

If this option is '1', the block layer will migrate request completions to the cpu "group" that originally submitted the request. For some workloads this provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion processing setting this option to '2' forces the completion to run on the requesting cpu (bypassing the "group" aggregation logic)"

scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.

NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.

Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :

fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.

write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.

read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.

front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.

writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.

Best Answer

Related Solutions

HFSC Scheduling – Understanding HFSC Scheduling in Linux/BSD

Linux Hardware RAID Controller Tuning – Real-World Guide

Related Topic