Re: [suse-oracle] Linux-Clustering

From: Martin Konold (martin.konold@erfrakon.de)
Date: Thu Sep 18 2003 - 23:11:29 CEST


From: Martin Konold <martin.konold@erfrakon.de>
Date: Thu, 18 Sep 2003 23:11:29 +0200
Message-Id: <200309182311.29485.martin.konold@erfrakon.de>
Subject: Re: [suse-oracle] Linux-Clustering


Am Thursday 18 September 2003 09:23 am schrieben Sie:

Hi Yves,

> Can you give some more examples of those HA features for SuSE Linux ?
> I know about "OpenMosix" !

OpenMosix clustering software is not typically used for HA setups but for a
single image HPC (High Performance Computer). There are some uncommon
(non-dbms scenarious where OpenMosix can improve availability)

> What other software can be used to cluster Linux Pc's ?

It pretty much depends on the kind of applications your want to make highly
available, the minimum downtime you can accept and the money/effort you want
to invest.

In a lot of cases HA can simply be gained by using application level
replication technologies e.g. the traditional database replication log
approach.

On the other hand there are shared media or shared nothing approaches.

Typical shared media technologies for Oracle are nfs (http://www.netapp.com/,
certified by Oracle), shared SCSI (can use either SCSI-IDE or SCSI-SCSI
encolsures, be careful about certification) or shared FC-Arrays. There are
plenty of Oracle certified shared SCSI/FCAL solutions available. (I most
often go for HP)

Recently Oracle announced a cheap shared media solution based on Firewire.
http://otn.oracle.com/oramag/webcolumns/2003/techarticles/coekaertsfirewirese
tup.html According to my testing (we demoed this solution at Cebit this year)
 IEEE1394 is not working reliably with two hosts on a single device in many
 heavy duty cases. AFAIK Oracle currently only uses this Firewire solution
 for
"experimental" RAC setups. I have no indication that Oracle is actually
supporting Firewire setups for production.

Depending on your requirements for data availability it is possibly
recommended to have multiple paths to the shared storage without a single
point of failure.

Last but least there is the shared nothing approach of which DRDB is the most
prominent implementation. DRDB is in my opinion not giving the reliability
and robustness as required by an enterprise HA system.

Michael: Does Oracle provide service for DRDB setups?

If your application does not allow for the simple replication mechanism then
imho shared media is the most reliable solution.

For small dual node active/passive or active/active clusters Heartbeat plus
monitoring is definetly enough. Larger systems with many nodes are best
handled by more complex software like Failsave or Lifekeeper. Both Heartbeat
and Failsave are offered for SLES users.

Because DRDB was recommended on suse-oracle@suse.com I want to give some
background information why DRDB is not enterprise ready.

DRDB makes several problematic assumptions:

1. DRDB assumes that journalling filesystems allow for a correct filesystem
independent of the time when the switch off happens.

Unfortunately this is not always the case. A crash while some block gets
written to the disk does _not_ guarantee that the file is not corrupted. If
this file happens to be important you might be in trouble. DRDB does not know
about filesystems but only about blocks on a block oriented device.
Unfortunately for DRDB the linux kernel is allowed to reorder blocks when
writing to disk.

Accessing the lower level block device is done with a temporary copy of the
buffer_head and a call to ll_rw_block. The Linux kernel is then free to
reorder the blocks (no control by DRDB).

2.) DRDB assumes that stricter acknowledge semantics improve reliability
 while forgetting that stricter acknowledge semantics also cause tighter
 coupling of the nodes and consequently blocking/starvation of the cluster
 services.

DRDB tries to use three different modes (a,b,c) in order to address the
problem.

a) a block is considered to be on disk as soon as the data is on the local
disk of the active node and got sent over the network to the passive node. In
mode a DRDB does not wait for confirmation that the block actually arrived on
the passive node and really was written on the harddisk of the passive node.

The following scenario leads to loss of data an inconsistent DB in the case
 of transactions:

DRDB employes TCP for the transport layer. DRDB tells the DBMS on the active
node that the data arrived on the harddisk. Active/primary nodes fails
immediately afterwards. The primary node is then incapable of doing the
correct TCP anymore and it is not guaranteed that the data actually gets
written to the harddisk of the passive/secondary node. Now when the secondary
nodes takes over the DBMS the consistency of the DB with regards to
transactions might be broken.

Conclusion: DRDB mode a is inacceptable for a DBMS.

b) this DRDB algorithm tries to keep the transactional properties of the DBMS
by delaying the acknowledging to the DBMS until the passive node sent back an
acknowledge. This protocol is slows down the primary server significantly
because it requires twice the network latency before an transaction is
finished. In addition the request queues of the linux kernel are limited and
so the performance is really hurt.

A problem is that the passive node sends the acknowledge before the data is
really written to the harddisk. In case of nearly simultaneous failures e.g.
a lightning strike blows the fuse/UPS of both the complete computer room
there is some real danger of corrupting the transactions of the DBMS.

Another drawback of this solution is that in case the secondary node or the
intermediate network connection fails the primary node comes to a halt. So
the failure of the passive node kills the operation of the otherwise fully
functional primary node. This hardly can be the goal of a redundant HA
cluster setup.

Conclusion: DRDB mode b is inacceptable for a DBMS.

c.) a block is only then considered to be really written to disk when both
 the active and the passive node made sure that the data is physically on the
 harddisk. This should while beeing horribly slow in practice in theory
 assure the transactional properties of the DBMS.

While this very slow mode c avoids the problems of too early acknowledgment
 of mode b it still suffers from the other drwaback of mode b. Which means
 that a failure of the passive node stops the operation of the active node.

Conclusion: DRDB mode c is inacceptable for a DBMS.

Yours,
-- martin konold

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de

---------------------------------------------------------------------
To unsubscribe, e-mail: suse-oracle-unsubscribe@suse.com
For additional commands, e-mail: suse-oracle-help@suse.com
Please see http://www.suse.com/oracle/ before posting



This archive was generated by hypermail 2.1.7 : Thu Sep 18 2003 - 23:17:47 CEST