Hiding a Subreport: High reliability == single point of failure?

I am trying to learn about high-reliability clusters (for IIS and Sql in my
case) but there is a fundamental thing I don't understand about the approach
that keeps me from swallowing the high-reliability kool-aid:
Currently, I have a "non-high-reliability" (low-reliability?) system --
i.e., one server that handles IIS, Sql, and data store all at once. This
server uses RAID 1 and has redundant power supplies. Let's assume that the
probability of catastrophic failure of this server is P.
Now I move to a high-reliability solution. Microsoft suggests at least two
machines handling IIS in an active/active configuration, two machines
handling Sql in an active/passive configuration, and one SAN-style data
store running RAID 1 (at minimum) with redundant power supplies. For the
sake of argument, let's assume that all five of these new boxes have similar
complexity both to each other and to the server in my original
low-reliability system, and therefore that the probability of individual
failure of any of these servers is also P.
The weakness of my original low-reliability system is that there is a single
point of failure (the server) which may fail catastrophically with
probability P. But my new high-reliability system also has a single point of
failure (the storage disk array) which also may fail catastrophically with
probability P. What exactly have I gained?
Michael Carr
Most SAN storage arrays have multiple controllers, multiple SAN switches
and multiple disks configured in a redundant RAID array.
Therefore, for your storage array to completely fail, you would need
multiple components of it to fail in order for you to completely lose it.
So, in your example, although the storage array may seem a single point of
failure, due to its redundant nature, it is far less likely to fail
completely than any of your other components.
HTH
Lee
"Michael Carr" <mcarr@.umich.edu> wrote in message
news:uX2oCtewFHA.2792@.tk2msftngp13.phx.gbl...
>I am trying to learn about high-reliability clusters (for IIS and Sql in my
>case) but there is a fundamental thing I don't understand about the
>approach that keeps me from swallowing the high-reliability kool-aid:
> Currently, I have a "non-high-reliability" (low-reliability?) system --
> i.e., one server that handles IIS, Sql, and data store all at once. This
> server uses RAID 1 and has redundant power supplies. Let's assume that the
> probability of catastrophic failure of this server is P.
> Now I move to a high-reliability solution. Microsoft suggests at least two
> machines handling IIS in an active/active configuration, two machines
> handling Sql in an active/passive configuration, and one SAN-style data
> store running RAID 1 (at minimum) with redundant power supplies. For the
> sake of argument, let's assume that all five of these new boxes have
> similar complexity both to each other and to the server in my original
> low-reliability system, and therefore that the probability of individual
> failure of any of these servers is also P.
> The weakness of my original low-reliability system is that there is a
> single point of failure (the server) which may fail catastrophically with
> probability P. But my new high-reliability system also has a single point
> of failure (the storage disk array) which also may fail catastrophically
> with probability P. What exactly have I gained?
> Michael Carr
>
|||Lets focus on the storage array first. A standard SCSI enclosure runs on a
single backplane, single controller system. You can run a multi-channel
controller card to two enclosures set up with RAID 1+0 across the enclosures
but you are still on a single controller. A mid-range SAN typically has two
internal controllers designed to check and supplement each other. You also
have multiple attachments (HBAs) from the host computer to the SAN, multiple
power supplies, and multiple enclosure racks. Basically, a well-designed
SAN implementation seems like a single array but in reality the only
non-redundant part of the SAN is the sheet metal enclosure. Those have a
very low failure rate once they are installed.

You can still drop your main production database through operator error, but
that is where processes and procedure either help or hurt your availability.
BTW, clustering IIS servers is a waste of time and money. Those are
typically set up as a pool of stateless servers where computers can be
dropped offline or brought online and the worst a user sees is a slow
response or maybe a retry. The idea of dedicating a server to each task is
for scalability, stability, and managability.
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
"Michael Carr" <mcarr@.umich.edu> wrote in message
news:uX2oCtewFHA.2792@.tk2msftngp13.phx.gbl...
>I am trying to learn about high-reliability clusters (for IIS and Sql in my
>case) but there is a fundamental thing I don't understand about the
>approach that keeps me from swallowing the high-reliability kool-aid:
> Currently, I have a "non-high-reliability" (low-reliability?) system --
> i.e., one server that handles IIS, Sql, and data store all at once. This
> server uses RAID 1 and has redundant power supplies. Let's assume that the
> probability of catastrophic failure of this server is P.
> Now I move to a high-reliability solution. Microsoft suggests at least two
> machines handling IIS in an active/active configuration, two machines
> handling Sql in an active/passive configuration, and one SAN-style data
> store running RAID 1 (at minimum) with redundant power supplies. For the
> sake of argument, let's assume that all five of these new boxes have
> similar complexity both to each other and to the server in my original
> low-reliability system, and therefore that the probability of individual
> failure of any of these servers is also P.
> The weakness of my original low-reliability system is that there is a
> single point of failure (the server) which may fail catastrophically with
> probability P. But my new high-reliability system also has a single point
> of failure (the storage disk array) which also may fail catastrophically
> with probability P. What exactly have I gained?
> Michael Carr
>
|||"Michael Carr" <mcarr@.umich.edu> wrote in message
news:uX2oCtewFHA.2792@.tk2msftngp13.phx.gbl...
>I am trying to learn about high-reliability clusters (for IIS and Sql in my
>case) but there is a fundamental thing I don't understand about the
>approach that keeps me from swallowing the high-reliability kool-aid:
> Currently, I have a "non-high-reliability" (low-reliability?) system --
> i.e., one server that handles IIS, Sql, and data store all at once. This
> server uses RAID 1 and has redundant power supplies. Let's assume that the
> probability of catastrophic failure of this server is P.
> Now I move to a high-reliability solution. Microsoft suggests at least two
> machines handling IIS in an active/active configuration, two machines
> handling Sql in an active/passive configuration, and one SAN-style data
> store running RAID 1 (at minimum) with redundant power supplies. For the
> sake of argument, let's assume that all five of these new boxes have
> similar complexity both to each other and to the server in my original
> low-reliability system, and therefore that the probability of individual
> failure of any of these servers is also P.
> The weakness of my original low-reliability system is that there is a
> single point of failure (the server) which may fail catastrophically with
> probability P. But my new high-reliability system also has a single point
> of failure (the storage disk array) which also may fail catastrophically
> with probability P. What exactly have I gained?
P, in your case is the Probability of your non-HA environment failing. P',
in this case is the probability of the SAN array failing.
P' is extremely low in that all internal channels of a SAN have multiple
paths, all drives configured with in the SAN are RAID protected at some
level or another, all power supplies and fans are highly redundant.
Basically, in a high-end SAN, it would take several failed components to
cause a failure of the SAN itself. Since the SAN is so vital, it will
normally have monitoring solutions keeping an eye on it, and in most cases,
a modem that dials the vendor to report any failure or prefailure
conditions.
So, the probability of P is much higher than the probability of P'.
Russ Kaufmann
MVP - Windows Server - Clustering
http://www.clusterhelp.com - Cluster Website
http://msmvps.com/clusterhelp - New Blog
http://spaces.msn.com/members/russkaufmann - Old Blog
|||"Geoff N. Hiten" <SRDBA@.Careerbuilder.com> wrote in message
news:OZ4pC$fwFHA.3756@.tk2msftngp13.phx.gbl...

> BTW, clustering IIS servers is a waste of time and money. Those are
> typically set up as a pool of stateless servers where computers can be
> dropped offline or brought online and the worst a user sees is a slow
> response or maybe a retry.
I may not agree with Geoff on this subject depending on the definition of
the cluster referred to in the original post. If Geoff interpretted
Michael's post to mean IIS would be configured in a server cluster using
MSCS, then I agree with him. It is a waste of time and money.
However, if Michael meant NLB clustering, then I do not agree with Geoff on
this subjet. IIS with NLB is a very good solution in that it does
1. Provide horizontal scaling of the application front end.
2. Provides high availability in that users can fail over to a surviving
node in the NLB cluster in the event of a node failure.
Applications that do require state information often maintain that state
using the SQL database or cookies on the client side.
Clustering IIS server through NLB is not a waste of time or money if your
business depends on the availability of your web based applications.
Russ Kaufmann
MVP - Windows Server - Clustering
http://www.clusterhelp.com - Cluster Website
http://msmvps.com/clusterhelp - New Blog
http://spaces.msn.com/members/russkaufmann - Old Blog
|||I was referring to Clustering IIS using MSCS as a waste of time and money.
IIS scale-out shoud be done with NLB clustering or any other load-balancing
system. I am indifferent as to whether NLB or a third-party solution is
used. As always, which method is best for you will depend on your exact
situation and requirements. I have used NLB in the past and have found it
quite useful.
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
"Russ Kaufmann [MVP]" <russ@.exchangemct.com> wrote in message
news:eXxH1RswFHA.2656@.TK2MSFTNGP09.phx.gbl...
> "Geoff N. Hiten" <SRDBA@.Careerbuilder.com> wrote in message
> news:OZ4pC$fwFHA.3756@.tk2msftngp13.phx.gbl...
>
> I may not agree with Geoff on this subject depending on the definition of
> the cluster referred to in the original post. If Geoff interpretted
> Michael's post to mean IIS would be configured in a server cluster using
> MSCS, then I agree with him. It is a waste of time and money.
> However, if Michael meant NLB clustering, then I do not agree with Geoff
> on this subjet. IIS with NLB is a very good solution in that it does
> 1. Provide horizontal scaling of the application front end.
> 2. Provides high availability in that users can fail over to a surviving
> node in the NLB cluster in the event of a node failure.
> Applications that do require state information often maintain that state
> using the SQL database or cookies on the client side.
> Clustering IIS server through NLB is not a waste of time or money if your
> business depends on the availability of your web based applications.
>
> --
> Russ Kaufmann
> MVP - Windows Server - Clustering
> http://www.clusterhelp.com - Cluster Website
> http://msmvps.com/clusterhelp - New Blog
> http://spaces.msn.com/members/russkaufmann - Old Blog
>
sql

Friday, March 23, 2012

High reliability == single point of failure?

No comments:

Post a Comment

Hiding a Subreport

Blog Archive

About Me