Fault tolerance

Topic > Fault tolerance

Today there is a demand for a highly secure virtual network where you can share any resource from any cluster even in case of failure in the system. Grid computing is a distributed computing paradigm that differs from traditional distributed computing in that it is aimed at large-scale systems that also cross organizational boundaries. In addition to the challenges of managing and programming these applications, reliability issues arise due to the unreliable nature of the network infrastructure. A failure can occur due to link failure, resource failure, or any other reason that must be tolerated for the system to operate smoothly and accurately. These faults can be detected and repaired by numerous techniques used accordingly. An appropriate fault detector can avoid losses due to system crashes and a reliable fault tolerance technique can save from system failures. Fault tolerance is an important property to achieve reliability, availability and QoS. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get Original Essay The fault tolerance mechanism used here sets job checkpoints based on the failure rate of the resources. If a resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another resource in the grid. Selecting the optimal intervals for applying checkpoints is important to minimize the application execution time in the presence of system errors. In case of failure of a resource, failure index-based rescheduling,algorithm reschedules the job from the failed resource to some,other available resource with the minimum failure index value,and executes the job from the last saved checkpoint. This ensures that the work gets done within the deadline with higher productivity and helps in making the network environment reliable. Grid computing is a term that refers to the combination of computing resources from multiple administrative domains to achieve a common goal. The grid can be thought of as a distributed system with non-interactive workloads involving a large number of files. While a grill may be dedicated to a specialized application, it is more common for a single grill to be used for a variety of different purposes. Grids are often built with the help of generic grid software libraries known as middleware. Grid enables the sharing, curation, and aggregation of a wide variety of geographically distributed resources including supercomputers, storage systems, data sources, and specialized devices owned by different organizations. Managing these resources is an important infrastructure in the grid computing environment. To exploit the promising potential of computational grids, fault tolerance is of paramount importance as resources are geographically distributed. Furthermore, the probability of failure is much higher than traditional parallel computing, and resource failure fatally affects job execution. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults and makes the system more reliable. Fault tolerance services are essential to meet QoS requirements in grid computing and deal with various types of resource failures, which include process failures and network failures. One of the important parameters in a checkpoint system that provides fault tolerance is the check printing interval or application state checkpoint period. IntervalsSmaller checkpoint intervals increase the cost of running the application due to the checkpoint, while larger checkpoint intervals increase the time to recover from failures. Optimal checkpoint intervals that lead to a minimum application execution time in the presence of errors should therefore be determined. PROBLEMS: 1. If a failure occurs on one network resource, the job is rescheduled on another resource, with the end result of failing to meet the user's QoS requirements, i.e. expiration. The reason is simple. When the job is rerun, it takes longer. 2. In compute-based grid environments, there are resources that satisfy the deadline constraint criterion, but have a bias towards adults. In such a scenario, the grid scheduler selects the same resource for the simple reason that the grid resource promises to satisfy the requirements of the users of the grid jobs. This ultimately results in compromising the user's QoS parameters to complete the job. 3. If a running task should be completed within the deadline even if there is a fault in the system. Deadline in a real-time system is the main problem because there is no meaning in such a task not being finished before the deadline. 4. Availability of real-time distributed systems of end-to-end services and ability to suffer systematic failures or attacks, without impacting customers or operations. 5. It concerns the ability to handle an increasing amount of work and the ability of a system to increase total throughput under a greater load as resources are added. In this scenario, the adaptive checkpoint fault tolerance approach is used to overcome the above-mentioned drawbacks. In this approach, information about failure occurrence is maintained for each resource. When an error occurs, the error occurrence information for that resource is updated. This information about the occurrence of the error is used during the decision-making process regarding the assignment of resources to the job. Checkpointing is one of the most popular techniques for providing fault tolerance on unreliable systems. This is a snapshot recording of the entire system state to restart the application after some error occurs. The checkpoint can be stored in temporary or stable storage. However, the efficiency of the mechanism strongly depends on the duration of the checkpoint interval. Frequent checkpointing can increase overhead, while slow checkpointing can lead to the loss of significant computations. Therefore, the decision on checkpoint interval size and checkpointing technique is a complicated task and should be based on knowledge of the application and system. Checkpoint recovery depends on the MTTR of the system. Periodically saves application state to a stable storage device, usually a hard drive. After a crash, the application restarts from the last checkpoint instead of from the beginning. There are three control painting strategies. They are coordinated checkpoints, uncoordinated checkpoints and communication-induced checkpoints. In coordinated checkpointing, processes synchronize checkpoints to ensure that saved states are consistent with each other, so that the overall combined saved state is also consistent. In contrast, 2. in the uncoordinated chick, processes schedule checkpoints independently at different times and do not take messages into account. Communication-induced checkpointing attempts to coordinate only selected critical checkpoints. Comparative analysis of existing techniques: a grid resourceis a member of a grid and offers computing services to users of the grid. Grid users register with a grid's Grid Information Server (GIS) by specifying QoS requirements such as deadline to complete execution, number of processors, operating system type, and so on. The components used in the architecture are described below: Scheduler - Scheduler is an important entity of a grid. The scheduler receives jobs from network users. Select feasible resources for such jobs based on the information acquired from the GIS. It then generates mappings between job and resource. When the schedule manager receives a grid job from a user, it gets the details of the available grid resources from GIS. It then passes the list of available resources to the entities in the MTTR scheduling strategy. The Matchmaker entity performs matchmaking of resources and job requirements. The ResponseTime Estimator entity estimates the response time for the job on each matching resource based on the transfer time, queue wait time, and server service time. Work. The resource selector selects the resource with minimum response time. A job dispatcher sends jobs one by one to the checkpoint manager. GIS-GIS contains information about all available grid resources. It maintains resource details such as processor speed, available memory, load, and so on. All grid resources that join and leave the grid are tracked by the GIS. Whenever a scheduler has jobs to execute, it consults the GIS to obtain information about available network resources. Checkpoint Manager: Receives scheduled work from the scheduler and sets the checkpoint based on the failure rate of the resource it is scheduled on. Then send the work to the resource. Checkpoint Manager receives a job completion message or job error message from the grid resource and responds accordingly. During execution, if an error occurs, the job is rescheduled from the last checkpoint instead of running from scratch. The checkpoint manager implements a better algorithm for setting job checkpoints. Checkpoint Server: On each checkpoint set by the checkpoint manager, the status of the job is reported to the checkpoint server. The Checkpoint server saves the job state and returns it on request, i.e. during a job/resource failure. For a particular job, the checkpoint server discards the previous checkpoint result when a new checkpoint result value is received. Fault Index Manager: Fault Index Manager maintains the fault index value of each resource which indicates the failure rate of the resource. The failure rate of a network resource is incremented whenever the resource does not complete its assigned work within the deadline and also in case of failure of the resource. A resource's error rate is decreased each time the resource completes its assigned work by the deadline. The error index manager updates the error index of a network resource using the error index update algorithm. Checkpoint Replication Server: When a new checkpoint is created, Checkpoint Replication Server launches CRS which will replicate checkpoints created on remote resources by applying RRSA. Once replicated, the details are stored in Checkpoint Server. To obtain information about all checkpoint files, Replication Server queries the Checkpoint Server. Throughout the application runtime, CRS monitors Checkpoint Server for the latest checkpoint versions. Information about available resources, hardware, memory, and bandwidth details is obtained from the GIS. The NWS and Ganglia tool is used for.