Real-Time and Fault Tolerant Systems Essay

Pages: 11 (3152 words)  ·  Bibliography Sources: ≈ 11  ·  File: .docx  ·  Level: Master's  ·  Topic: Education - Computers

Buy full Download Microsoft Word File paper
for $19.77
Real Time and Fault Tolerant Systems

The Quest for Zero Downtime:

Since the dawn of the Internet, the need for application availability and reliability has continually increased over time. This need is especially strong for the military, aerospace, and aircraft control industries, where any amount of downtime can have fatal consequences. In the 1998 case study titled, "NCAPS: Application High Availability in UNIX Computer Clusters," by Luiz a. Laranjeira, Tandem Computers developed a specialized software system that can run on Unix computer clusters while providing a superior level of application availability. This essay offers a critique of the case study as well as of the software architecture and fault tolerance strategies used.

Design Goals

Since the dawn of the Internet, application availability has increased immensely. However, at the time of the above case study, there was still a need to improve the recovery times of existing high availability solutions, especially concerning real-time critical applications. Recovery times were too long, expensive, and unreliable, lasting anywhere between one minute and an hour. Therefore, the key design goal of the NCAPS system was to ultimately provide continuous availability of real-time critical systems in the event of hardware, software, or operating system faults. Also, by helping to significantly shrink recovery times of large-scale applications, the NCAPS design could not only ensure that these vital systems would remain up and running, but help reduce the hefty costs associated with downtime.

System Architecture

NCAPS provides specialized system software that runs on a Unix computer cluster with two or more nodes. According to some industry experts, this is the minimum requirement for a high availability cluster. Additionally, the system can provide more rapid failover because it is based on a primary/backup scheme, where two instances of an application are running at the same time.

As described in the case study, the NCAPS software architecture includes the Node Status Monitor (NSM), the Keepalive (KpA), the Process Pairs Manager (PPM), the Open Fault Tolerance Library (OftLib), and the Command Line Interface (CLI). The NSM, KpA, and PPM are replicated in both nodes and interact through continuous monitoring and message communication. The state of the two nodes is monitored by the NSM. The KpA keeps an eye on registered processes and uses a script to restart them in the case of failure.

More important, the PPM is the core of the NCAPS system and it starts, monitors, and manages application processes through the use of a process pairs paradigm. Plus, the PPM state model can be configured by the user, which is a key competitive advantage over other high availability software vendors.

Fault Tolerance Strategies Used

Redundancy

Redundancy has been defined as the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe (Answers.com, 2010). The two nodes of the NCAPS system offer redundancy because they mirror each other, always providing one node in primary status and the other in backup status. If one fails, the other is available to take over.

Redundancy in the NCAPS system can also be found in the NSM, where "heartbeats" are exchanged between the two NSMs. "When one NSM does not receive a configurable number of heartbeats from the other within a configurable period of time, it sends a node-down message to its subscribers (the PPM only). When the other node is restarted and the two NSMs resume exchanging heartbeats, the NSM sends a node-up message to its subscribers," (Laranjeira, 1998 p. 442).

Always ready to switch from a backup to a primary state, the PPM provides redundancy as well, "One instance of the PPM and of the watched application run in each of two nodes of a cluster. In one node an instance of the application is in a primary state and is providing service. In the other node another instance of the application is in a backup state. A backup application is not providing service, but it is initialized and ready to take over in case of a failure of the primary application or of its node," (Laranjeira, 1998 P. 442).

Described as a highly available service implemented as a primary and shadow instance, the Keepalive component offers further redundancy, "These two instances send heartbeats to each other and share information through a memory mapped file. If the shadow instance dies, the primary restarts it. If the primary instance dies, the shadow instance becomes primary, takes control of the memory mapped file, and spawns another shadow instance," (Laranjeira, 1998 p. 442).

Fault/Error Isolation/Containment

Information on fault/error isolation and containment of the NCAPS system was not clearly disclosed in the case study. Specifically regarding the PPM and application processes, however, the following was stated, "In failure situations, the PPM executes a cleanup script and restarts the application to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted," (Laranjeira, 1998 p. 443). In addition, as you'll see below, the Hang Detection Service will "kill" an offending process if a hang is detected. But otherwise, relatively no information was provided on how faults or errors are contained.

Fault/Error Detection

In the case study, there is little-to-no definition of the types of faults the system is detecting, whether they are transient, permanent, or intermittent. Based on background course material, this may not be a good thing, "In an ultra-reliable system, it is essential to have error detection and recovery mechanisms designed to handle transient faults. These mechanisms must be able to distinguish transient faults from permanent or intermittent faults, so that when a transient fault is detected in a unit the unit is not discarded," (Course Objectives

). That said, it does not mean that the NCAPS system is unable to distinguish between the different types of faults, it just raises questions because this essential information was not provided in the case study.

Fault detection functionality in the NCAPS system can be found in the PPM, "When an application process fails, the PPM detects it and restarts it up to a maximum configurable number of times. After this threshold is exceeded, the next failure of that process will imply in a failure of the application," (Laranjeira, 1998 p. 443). Also, the Application Administration (AAD), a key component of the PPM, provides fault detection by mediating the interactions between the Application State Model (ASM) and the application. The AAD detects an application event, such as a failure, and directs it into the ASM. After a state change takes place, an ASM action triggers the AAD to send a state change command message to the application processes (Laranjeira, 1998 p. 446).

Part of the functionality provided by the Open Fault Tolerance Library (OftLib) is the Hang Detection Service (HDS), which offers the capability to detect faults that cause a process to "hang." By using heartbeats with specified time intervals, the Hang Detection Service detects when a heartbeat is not received when expected and thus responds with the appropriate action. At this point, HDS simply ends the offending process. Keepalive then detects that the process no longer exists and the appropriate recovery mechanisms are triggered (Laranjeira, 1998 p. 444).

System Reconfiguration

There was relatively little information provided on the system reconfiguration of the NCAPS system. Regarding the PPM, some functionality is configurable and allows users to define and execute their own scripts during a state change. Therefore, during specific state changes, the user may determine which actions should be applied, and may include the transfer of resources in the event of application failover or the trigger of an alarm due to a specified state change (Laranjeira, 1998 p. 443).

System Recovery

System recovery within the NCAPS system is typically handled by the PPM or the Command Line Interface (CLI). With regard to the PPM, "In failure situations, the PPM executes a cleanup script and restarts the application up to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted." (Laranjeira, 1998 p. 443). In addition, the CLI provides system administrators with the control they need to perform a range of capabilities, including the ability "to query the application's state or to manually cause the application to failover, become primary, reinitialize, inhibit the failover function, (when the application is in the backup state), un-inhibit the failover function, startup or shutdown," (Laranjeira, 1998 p. 444).

Conclusion

Overall, it appears that the NCAPS system is a highly effective solution that is built on a solid, logical architecture. From the primary/backup design to the PPM, NSM, and Keepalive components, redundancy is prevalent throughout the system and helps to provide high availability, resiliency, and security. Also, multiple components help monitor the system, applications, and application processes as well as allow communication between the various components. Plus, the PPM, AAD, and HDS can detect faults by monitoring system heartbeats, errors, and potential failures. System reconfiguration can be defined by the user and system recovery can be handled… [END OF PREVIEW] . . . READ MORE

Two Ordering Options:

?
Which Option Should I Choose?
1.  Buy full paper (11 pages)Download Microsoft Word File

Download the perfectly formatted MS Word file!

- or -

2.  Write a NEW paper for me!✍🏻

We'll follow your exact instructions!
Chat with the writer 24/7.

Raid Stands for Redundant Array of Independent Term Paper


Dependable Distributed Systems Term Paper


Innovation and Collaboration at Coca-Cola Term Paper


Human Factors in Aviation Safety Thesis


Functionality and Delivery of CRM at Verizon Using Cloud Computing Research Proposal


View 34 other related papers  >>

Cite This Essay:

APA Format

Real-Time and Fault Tolerant Systems.  (2010, April 25).  Retrieved January 27, 2020, from https://www.essaytown.com/subjects/paper/real-time-fault-tolerant-systems/458254

MLA Format

"Real-Time and Fault Tolerant Systems."  25 April 2010.  Web.  27 January 2020. <https://www.essaytown.com/subjects/paper/real-time-fault-tolerant-systems/458254>.

Chicago Format

"Real-Time and Fault Tolerant Systems."  Essaytown.com.  April 25, 2010.  Accessed January 27, 2020.
https://www.essaytown.com/subjects/paper/real-time-fault-tolerant-systems/458254.