[suse-oracle] Storage replication with DRBD - some numbers

From: Lars Marowsky-Bree (lmb@suse.de)
Date: Thu Sep 25 2003 - 15:08:29 CEST


Date: Thu, 25 Sep 2003 15:08:29 +0200
From: Lars Marowsky-Bree <lmb@suse.de>
Message-ID: <20030925130829.GA27825@marowsky-bree.de>
Subject: [suse-oracle] Storage replication with DRBD - some numbers

Hi Martin,

as promised I have gone and run some benchmarks on top of drbd to
measure the introduced latency and throughput losses, in order to
estimate the impact of using DRBD underneath Oracle.

Next on the radar will be a TPC like benchmark (OSDL DBT-2 dbt2/dbt3),
but I think these numbers already provide some good hints at the
expected results.

I will describe the test scenario, the raw numbers, provide general
findings and finally an analysis of how this does affect Oracle
workloads.

1. Test scenario description

For testing, I used two HP DL580s. Each of these has quad-Xeons 1.6Ghz
(will HT enabled), 4GB of RAM and a local storage subsystem via the
CCISS driver RAID1+0.

I setup a 9GB Logical Volume on each via LVM; the filesystem used on top
was ext2 with -T largefile4 to minimise fs impact. I believe that ext2 +
LVM on top of hardware RAID is a good approximation for a typical
install.

The test tool used was "tiobench", a tool for measuring throughput and
IO latency, with a working set size of 6GB and 4, 8 or 16 working
threads, performing 4096 random IO operations for the read or writes
respectively. Each test run was repeated three times and the numbers
averaged.

These testruns were run on the raw local disk as the reference. They
were repeated with drbd in disconnected mode (which measures just the
local overhead introduced by drbd and not the replication cost) and
protocol A, B and C.

The network interconnect between the two nodes was a switched GigE
connection, with a raw avg ping RTT of ~0.13ms.

2. Raw numbers

See attachment.

3. General findings

As expected, the raw disk numbers and the disconnected numbers are
almost the same (within statistical accuracy on this scenario). This
implies that drbd doesn't introduce any local hotspots unrelated to
replication.

The same is true for all read numbers; as drbd hands all reads off to
the local disk directly, no replication cost is introduced at all.
Again, this is expected.

The 16 threads sequential read has one number which is somewhat off,
namely the no-drbd one; this seems to be a bad testrun out of the three,
but it's not so bad that the test would have needed to be repeated.

Now, to the writes via protocol A, B and C, which is where - as expected
- drbd does have an impact.

The first obvious finding: A & B fall out of the pattern completely for
sequential writes, which is interesting. To remind you, protocol A is
the one with the lowest integrity guarantees, signaling completion as
soon as the packet has left the local send buffers, and B as soon as
the recipient has acknowledged the receipt. Both should in theory be
faster than protocol C. This is unexpected and will warrant further
investigation by the drbd development group.

Protocol C, the one with the highest integrity guarantees (and the only
one of interest for transactional consistency) is OK. As it's the only
one Oracle customers should care about, the test is still good.

Now, for the sequential writes, where one might naïvely expect the least
slowdown, we actually see the highest (compared to random writes, which
we'll get to in a second). However, throughput and avg latency are
not too far off from the raw numbers. The gap is ~5% for 4 worker
threads and up to ~10% for 16 working threads.

My explanation for this is that this is due to the drbd barrier analysis
working too well. On the primary, due to the activity from the worker
threads themselves, the kernel starts flushing buffers to disk earlier
and does not wait until the barrier / sync occurs; on the secondary, we
can cache more (due to lack of the worker threads themselves), and thus
have to flush more when the barrier arrives. This would also explain why
the effect gets stronger with more worker threads.

Now, maximum latency is indeed worse. But this is to be expected after
the former explanation: The normal writes go through just fine, but the
final write with the barrier incurs a higher latency due to the remote
buffer flush.

Overall, the IO throughput is still pretty good. As expected: We have a
theoretical limit of ~90MB/s on the wire, but the raw disk can only
deliver 25MB/s writes. That's a factor of three and so not a likely
bottleneck.

Even for the raw numbers, it is interesting to see how latency gets
considerably worse with more worker threads. My explanation for this is
that the kernel combines writes quite successfully (the actual
throughput is pretty good and doesn't decline as much), but that more
continuous IO resources are available for each task also imply higher
latency in between.

Now to the random writes. Consistently, throughput is _way_ down from
sequential writes. Again, as expected: The kernel cannot combine random
writes together. The latency is way down at the same time, presumably
because of the same effect: Head seek time is more evenly distributed.

Again, we see a strong decline for IO latency of 4 vs 16 threads. The
avg latency is pretty low, but some really bad seeks seem to tip the
scale. Going from 4 to 8 threads totally kills maximum latency! I need
to go and count, but my guess for this would be that the storage
subsystem in question has effectively 4 heads, and going beyond that of
course implies a tremenduous slow down. ("More spindles are always
good")

Now, drbd protocol C is almost on par with the raw disks. Sometimes, the
number even suggest that it is better. How can that be; surely drbd does
incur some overhead?

Yes, but at the same time, the overhead introduced by drbd is orders of
magnitude smaller than the overhead introduced by physical disk head
seek times. Compare the latency of a disk head to an almost constant
0.13ms Round Trip Time and you see what I mean.

The variation we see here seems to be attributable to mostly statistical
noise.

4. CONCLUSIONS

First, drbd does not affect reads much or disconnected writes. So there
is no inherently slow path in drbd itself. Protocol A and B are
unuseable (and justify further analysis), but of no interest for
database workloads anyway, which demands the integrity guarantees of
protocol C.

For a real workload, which consists of a mixture of reads, writes both
sequentially and randomly and of course doesn't run them separately like
tiobench, the gap between DRBD and raw disk is likely not statistical
significant: I'd offer a 1-5% estimate.

The worst-case (sequential writes with a larger number of working
threads) is not all that common in practice, and would also likely not
be as severe as it was in this test setup; just some random IO in
between going to the same disk would (according to the explanation I
gave above, at least) reduce the gap tremenduously. But of course, this
worst case needs to be improved.

A real work load will be totally dominated by the physical disk seek
latency though and not be the comparatively small network latency. This
holds true for scenarios where network bandwidth and latency exceeds or
is close to the storage subsystem limits; this is likely the case for
'typical' smaller installations. A storage subsystem delivering >1GigE
isn't all that common there, yet, and a GigE card (or even two) are
cheap.

And keep in mind that typically, the system will be doing something else
in addition to raw disk IO too; if there's some actual computation
happening, this will even further diffuse the gap between raw disk and
DRBD.

As to Martin's "horribly slow in practice" I can only disagree again.
This may be true for replication over Fast Ethernet (with an order of
magnitude higher latency and an order of magnitude less bandwidth
compared to GigE), but it does not seem to affect current setups;
depending on your actual IO requirements, even FE may be enough though.

        So, I conclude with the finding that this benchmarking round
        suggests that DRBD is indeed a good (technical) choice for small
        to medium replication scenarios, even for databases.

Thanks again to Martin for motivating me to run these tests. I had been
meaning to for a long time, but nothing like a provocation to get one
going. ;-)

I encourage you to provide feedback on the findings and to repeat some
benchmark runs yourself.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering		ever tried. ever failed. no matter.
SuSE Labs				try again. fail again. fail better.
Research & Development, SUSE LINUX AG		-- Samuel Beckett







This archive was generated by hypermail 2.1.7 : Thu Sep 25 2003 - 22:43:18 CEST