NVMe Queue Depth vs FIO Iodepth Argument Relation

ionvme

This question is related to fio (flexible i/o tester) utility manages I/O queues for a NVME storage (SSD's in particular) whilst using libaio engine.

For testing i am using Ubuntu 14.04 and a commercial nvme SSD.

I enforce the fio experiment with following arguments:

direct=1
numjobs=256
iodepth=1
ioengine=libaio
group_reporting

For this example assume the nvme device advertises/supports 10 IO queue creations with max queue depth of 64.
You may assume 10 queue creations are successful as part of the initialization.

Based on the above parameters and constraints,
"How would fio use the queues, for I/O commands?" OR to narrow down the scope "Does the iodepth argument in fio directly equate to nvme_queue_depth for the test" this is the question.

I expect something like below to be going under the hood but do not have the right information.

example scenario 1:
does fio generate 256 jobs/threads which try to submit i/o to nvme_queues and try to keep atleast 1 i/o cmd in the 10 nvme_queues at any point. If the job sees the que is full (i.e. if (one) i/o cmd is present in a nvme_queue, just tries to submit in other 9 nvme_queues or perhaps round robins till it can find an empty queue)

example scenario 2:
does the fio 256 threads/jobs, not really respect the iodepth==nvme_quedepth to be used for this test and submit multiple i/o's anyway. So for example each thread just submits 1 i/o cmd to nvme_queues without any check on depth of commands in the 10 nvme_queues. In other words that the 256 threads try maintain roughly 25 or 26 i/o pending/inflight in the 10 nvme_queues.

Link to defintion of iodepth from fio documentation.

Are either scenario true ? and is there a way to ascertain the same with an experiment.

going through specification for nvme and fio does not really clearly state how this scenario is handled or is vague.

Update:
Below are the the two scenarios in image format
https://imgur.com/OCZtwgM (unable to embed) top one is scenario 1 and bottom is scn. 2

Update 2:
Sorry if the question is on the vague side, but i will attempt to improve on it by expanding a bit further.

My understanding that between the device and fio. Hence the question spans across many layers of code and/or protocol which are at play here. to list them
1. Fio (application) and libaio
2. Linux Kernel/OS
3. Nvme Driver
4. The Storage device SSD controller and code.

The two scenarios explained above are vague attempt to explain at a very high level to answer my own question, as i am not expert on above mentioned layers by far.

According to an answer below it seems scenario 1 is loosely relevant. And wanted to know tiny bit more regarding general policy and predictability through all the layers. Partial explanations are OK hopefully combining to a complete one.

so a third naive rephrase of the question would be "how does fio issue traffic and how they really end up the storage nvme_queues?"

Best Answer

TLDR; the design of userspace I/O submission to when I/O leaves the kernel is described in the Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems paper.

"How would fio use the queues, for I/O commands?" OR to narrow down the scope "Does the iodepth argument in fio directly equate to nvme_queue_depth for the test"

This thinking is a bit too wooly and I'd caution against it. "How would fio use the queues, for I/O commands?" - not just down to fio. "Does the iodepth argument in fio directly equate to nvme_queue_depth for the test" - maybe but probably not? In your example fio submits I/O to the kernel but the kernel in turn may choose to transform that I/O before it submits it to disk (see this answer to "What does iodepth in fio tests really mean? Is it the queue depth?").

scenario 1: does fio generate 256 jobs/threads which try to submit i/o to nvme_queues and try to keep atleast [sic] 1 i/o cmd in the 10 nvme_queues at any point

Sort of (but s/jobs/processes/). Each job (which will likely map to a process if your example is complete) is submitting at most one I/O to the kernel. However bear in mind fio has zero knowledge of what your disk can do and is at the mercy of whatever your kernel chooses to do with regard to scheduling of both its I/O and of each individual fio process.

example scenario 2

I'm afraid I don't understand as you didn't give a seperate example. Does "iodepth" in this case refer to fio's iodepth parameter?

Are either scenario true ?

I don't understand your scenario 2 and scenario 1 will have a lot of overhead and we don't know what the kernel will do to I/O so I don't think this can be answered definitively (for example I'm a bit suspicious that your NVMe disks are appearing with SCSI device nodes...).

I'd recommend looking at what iostat says while the fio jobs are running to get some idea but doing one I/O per each job is a VERY inefficient way to submit I/O and will incur a lot of overhead (which will likely prevent you from reaching the best speeds). Generally, you arrange for a job to submit the most I/O it can and ONLY then do you introduce additional jobs.

Update

Going by the diagram you included in your update we're more or less in what you think of as scenario 2 (256 processes each submitting one I/O, NVMe disk has multiple queues each with a seperate depth rather one giant queue) but imagine a funnel from userspace in to the kernel and another (different) funnel out from the kernel to the queues of the "disk" (let's ignore HW RAID etc) rather than some sort of one-to-one mapping from userspace process to disk queue. Unless you were using a userspace driver for your disk you will have very little control over which queue a given I/O will end up in as the kernel abstracts that decision for you. For example, your processes may wonder across CPUs (at the kernel's whim) and you may even have an unbalanced amount of processes on the same physical CPU (compared to the amount on other CPUs) further muddying the waters.

And wanted to know tiny bit more regarding general policy and predictability through all the layers

You will likely have to trace your way through the kernel block layer to determine while being aware of the kernel version you are using. Additionally, some Ubuntu 14.04 kernels weren't multi-queue aware (see thi LWN article for a brief summary of blk-mq) for NVMe disks (I think blk-mq support for NVMe devices arrived in the 3.19 kernel) and this harkens back to my "I'm a bit suspicious that your NVMe disks are appearing with SCSI device nodes" comment). There's a paper called "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems" that discusses the changes and benefits of blk-mq in more detail and covers areas 1-3. You are likely better off asking another question if you want details on area 4 but Coding for SSDs gives a starter summary.

how does fio issue traffic and how they really end up the storage nvme_queues?

Depends on the I/O engine and the configuration you chose :-) However for the example configuration you gave figures 2 and 5 of the "Introducing Multi-queue SSD Access on Multi-core Systems" paper cover what happens below fio and the "What does iodepth in fio tests really mean? Is it the queue depth?" answer covers within fio itself.