Re: [suse-oracle] Linux-Clustering

From: Lars Marowsky-Bree (lmb@suse.de)
Date: Fri Sep 19 2003 - 12:25:05 CEST


Date: Fri, 19 Sep 2003 12:25:05 +0200
From: Lars Marowsky-Bree <lmb@suse.de>
Message-ID: <20030919102505.GJ3143@marowsky-bree.de>
Subject: Re: [suse-oracle] Linux-Clustering

On 2003-09-18T23:11:29,
   Martin Konold <martin.konold@erfrakon.de> said:

Martin's explanations about the other scenarios are correct. I'll focus
on the drbd details. As you can likely guess, I have to disagree with
some of his points, which seem to be either based on older experiences
or on conceptual misunderstandings of how data replication can work.

> Michael: Does Oracle provide service for DRDB setups?

This is being worked on.

> If your application does not allow for the simple replication mechanism then
> imho shared media is the most reliable solution.

Except that of course the shared media becomes the single point of
failure, unless you have a lot of money to throw at the problem.

> For small dual node active/passive or active/active clusters Heartbeat plus
> monitoring is definetly enough. Larger systems with many nodes are best
> handled by more complex software like Failsave or Lifekeeper. Both Heartbeat
> and Failsave are offered for SLES users.

FailSafe is not actively maintained for SLES8.

> DRDB makes several problematic assumptions:
>
> 1. DRDB assumes that journalling filesystems allow for a correct filesystem
> independent of the time when the switch off happens.

This is a generic assumption by _all_ failover scenarios, be it
implemented by SCSI fencing or by STONITH. If it does not hold, the
filesystem / application in question is simply buggy, and drbd cannot be
blamed for it.

drbd actually goes to quite some pain to preserve write ordering, as I
will explain below.

Notice that if this assumption is violated, a spontaneous reboot of a
single node (no cluster involved at all) would also lead to data loss.
This is NOT a drbd bug.

> Unfortunately this is not always the case. A crash while some block gets
> written to the disk does _not_ guarantee that the file is not corrupted.

True. Filesystem meta data journaling is not the same as application
data journaling.

However, again, this is a generic problem, and applications like mail
transfer agents, databases (in particular, Oracle ;) do ensure that
their files are appropriately synced to disk using the mechanisms
available (fsync, fflush etc), have done so for a very long time and drbd
does not violate these constraints.

> Unfortunately for DRDB the linux kernel is allowed to reorder blocks when
> writing to disk.

You really should read up on how drbd is implemented.

First, the kernel is _NOT_ allowed to reorder writes across such
barriers as outlined above. That would be an ENORMOUS bug and break all
journaling applications, and you can be certain that you would have
heard about it before.

Second, drbd employs a dependency analysis algorithm (which was actually
Philipp's diploma thesis) to allow for _safe_ write reordering on the
replication target for higher performance, as explained here: (though
its German)
http://www.complang.tuwien.ac.at/Diplomarbeiten/reisner00.ps.gz

This is in line with the guidelines Oracle provides for replicated
storage, see
http://otn.oracle.com/deploy/availability/htdocs/oscp_papers.html#Oracle_OSCP

> 2.) DRDB assumes that stricter acknowledge semantics improve reliability
> while forgetting that stricter acknowledge semantics also cause tighter
> coupling of the nodes and consequently blocking/starvation of the cluster
> services.

Not more so than any raid1 / replication solution. But yes, drbd implies
a slight loss of performance. No doubt about it. Any replicated scenario
will be slightly slower than the same scenario without replication,
there is absolutely no doubt about it.

Now, whether that is a problem for a given scenario compared to the
benefit of replication is another story. Otherwise, people would not run
RAID at all, or at least only RAID1 and never RAID5. But obviously,
reliability benefits and (in the case of RAID5 vs RAID1, for example)
cost also are important to keep in mind.

> DRDB tries to use three different modes (a,b,c) in order to address the
> problem.

That is true. Though only protocol C does provide strict transactional
semantics and no other protocol should be used for databases. Again,
this is outlined in the documentation, and any person configuring
anything but mode C for database replication as we are discussing it
here has not done their homework at all.

You should not blame drbd for misconfigured systems; the other modes (A
& B) have their place in other scenarios (ie, mode A can be used for
offsite near-realtime backups); nobody ever claimed they are relevant
for Oracle local failover scenarios.

> c.) a block is only then considered to be really written to disk when both
> the active and the passive node made sure that the data is physically on the
> harddisk. This should while beeing horribly slow in practice in theory
> assure the transactional properties of the DBMS.

It is not horribly slow in practice. Please provide benchmark numbers to
back up your claims.

I've seen write speeds of 70MB/s (which is all my disks and GigE
interconnect could take) with protocol C. Remember latency is typically
very good with local GigE interconnects.

Sure it's slower than direct disk access (as is any RAID); however,
whether it is "horribly slow" depends entirely on the target scenario,
ie ratio of writes to reads, ratio of barriers to writes, dependencies
within those barriers on other barriers etc.

No, you probably will not use drbd if you require excessive high speed
sustained data rates. But does the average database really see so many
writes? How does that tradeoff compare to the benefits that you have a
completely physically independent (to the extents on how far that is
possible while being in the same city and barring nuclear strikes ;),
realtime and transactionally consistent replica of the database
available for failover?

I will also not claim drbd is entirely bug free. Neither is Oracle,
reiserfs, ext3, and I recall that apparently some large ISP has been
bitten by the EMC RDF facility a year or two back quite severely. So,
evaluation and testing for a particular scenario is an absolute must.

However, an absolute claim like "horribly slow in practice and thus not
acceptable" is snake oil, just like it would be to claim that drbd is
the silver bullet for all replication scenarios.

> While this very slow mode c avoids the problems of too early acknowledgment
> of mode b it still suffers from the other drwaback of mode b. Which means
> that a failure of the passive node stops the operation of the active node.

It only briefly delays the active node until the failure of the
secondary has been detected. Again, this is an inherent problem of all
online replication solutions, which requires _lots_ of money and quite
difficult algorithms to solve.

I highly recommend "Distributed Algorithms" by Nancy Lynch or in
general, any paper on "Agreement and Consistency in partially
synchronous distributed systems" as bedtime lecture. In particular, the
"impossibility results" are quite worth reading. (And I'll admit I have
not understood >50% of the other proofs ;-)

Martin, please, I know you have the scientific background to argue
better than this - with less bias, better understanding and better fact
checking. Please.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering		ever tried. ever failed. no matter.
SuSE Labs				try again. fail again. fail better.
Research & Development, SuSE Linux AG		-- Samuel Beckett




This archive was generated by hypermail 2.1.7 : Fri Sep 19 2003 - 12:25:31 CEST