redundancy and fault tolerance

WHen moving a system from Development to a Production environment, the issue of redundancy and Fault tolerance needs to be addressed. From now on, I will refer to both items as DR (Diasater Recovery).

Typical System

Depending on the complexity of the system, a typical computer system implementation can be thought of as

  • Servers
    • Services
    • Data
    • Protocols
  • Network

The usual solution to DR is simply to duplicate each system.

So that becomes

   Active      Passive

   Servers     Servers
   Services    Services
   Protocols   Protocols
   Network     Network

This in theory is fine - however if the Active system has been running for 1 year, the data on the Passive System will be totally out of date (This may include peoples passwords/accounts etc).

So one solution would be to have an "Active/Active" Solution.

Active Active

This is the best solution in terms of Availability. But it requires a full duplication of all the Servers, and resources. So your Hardware and Sofware Costs are all doubled.

Additionally you usually have to implement a two-phase commit type solution for all transactions (Unless you are using a Message Queue system). This two phase commit, requires usually a more expensive system/license. Plus you need to "Roll Forward" one server when a group of servers has been shutdown.

Active / Passive (Non VM)

You have a full replacement system (maybe of slightly lesser specification), and as long as you can restore the data in a timely manner then you can run in a reduced mode (less users, or longer access time).

The success of this solution requires

  • Both Systems to be kept in 100% Synch
  • Recent Data copies to be available
  • Restore time of the data to be hours not days

This is cheaper than Active-Active, but incurs more risk that the DR system will be 100% functional.

Active / Passive (VM)

Here we slightly change the hardware (although to be fair to the previous systems they also could use this solution). Instead of storing the data on a Disk that is inside the machine, we store the data in a SAN/NAS.

We access this storage using Fibre-Channel, the SAN having FC slots, the Servers having FC cards.

Further more we then store the Server Images (these are Virtual Images) on the NAS Drive...

If the system has 10 Physical Servers, we then provision hardware for 15 Servers (A Massive 50% failure rate) - but then as a Server fails then we can move the Server in real time - to a spare slot on the VmServer.

This has serveral advantages

  • Not all the HW is duplicated