In essence, Application Checkpointing is a rollback recovery method that saves program states at several phases. It aids in the restoration of the system to a consistent state after failure and achieves fault tolerance by regularly preserving the state of a process during failure-free execution.
High-Performance Computing Systems (HPC) use fault-tolerance mechanisms provided by Application Checkpointing while performing computations. Checkpointing is also used in Grid Systems with heterogeneous resources to make them more efficient and reliable. With the changing nature of computational infrastructure, now is a critical time to embrace Application Checkpoints as a necessary component of our long-running computational projects.
In this guide, we will cover the basics you need to know about Checkpointing and Application Checkpointing – concepts, benefits, types, uses, and many more. We’ll go through how it works and how its various varieties compare to one another.
Table of Contents
What Is Checkpointing?
As you read a book, you use bookmarks to mark your progress and save it for the next time you pick it up. The bookmark becomes your checkpoint, a place where a check can be performed.
Similar to bookmarks, Checkpointing is the process of saving the computational state of a program external to it, i.e., to an external storage disk. This checkpoint acts as a stamp for the application to resume computation if stopped, or recover in the event of an error. This brings in minimal loss of computing work and allows the computation to be restarted from its preserved state.
At every level of the system, a checkpoint can be created using a variety of approaches, ranging from leveraging unique hardware/architectural checkpoint features to modifying the user’s source code. There is a spectrum of techniques for users to introduce checkpoints, and more will be discussed later in this article.
Checkpointing is classified into four categories:
- Hardware-level, where additional hardware is placed into the processor to preserve its state.
- Kernel-level, where the Operating System (OS) is largely held responsible for using checkpoints on running programs.
- User-level, where a checkpointing library is linked to a program that will checkpoint the application without the programmer’s intervention.
What Is Application Checkpointing?
Application Checkpointing is the process of saving a snapshot of your application’s state. It is useful in designing real-time streaming applications that run for extended periods of time, like protein folding codes using ab initio methods in computational biology or phase space solutions in applied mathematics. These applications require weeks or months of computing, even on the fastest available computers.
In such long-running processes, producing error-free results is a must; you wouldn’t want your machine to run for months and produce the wrong output. This is why Application Checkpointing is useful: it periodically stops the calculation, copies all the required data from memory to your external storage, and then resumes the execution, a process also called Checkpoint and Restart (CPR).
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!
Get Started with Hevo for Free
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
What Are the Benefits of Application Checkpointing?
Sometimes, the program’s execution time might exceed the hardware’s failure rate. In that stage, it is logical to assume that if the application can’t be retrieved from some intermediate point in the processing, it’ll never finish. As a result, Application Checkpointing ensures resilience in the event of a failure, by creating checkpoints from where the application can resume, instead of having to start from scratch.
System-Level Checkpoints (SLC) (Kernel-level), which are essentially core-dump-style snapshots of the computational state of the machine, pose a major disadvantage when used. They are machine specific and can’t be run on a platform different from the one they were created.
On the other hand, in Application Checkpointing (Application-level), applications can save and restore their own state periodically. They can Self-Checkpoint and Self-Restart, which eliminates their excessive dependence on SLC. Since these are created by applications themselves and aren’t specific to a machine or operating system, they can be restarted on different platforms.
Not only this, it’s generally more common for Application-Level Checkpoints to be low on storage compared to System-Level Checkpoints. In one example, a protein folding Application-Level Checkpoint on the IBM Blue Gene Machine occupied a few megabytes in size, whereas a full System-Level Checkpoint occupied a few terabytes.
To conclude, if you work with a lot of data involving a lot of computation, Application Checkpointing is the way to go.
Types of Application-Level Checkpointing
In the following sections, we describe three different types of Application-Level Checkpointing techniques: Single-Threaded Application-Level, Multi-Threaded Application-Level, and Cluster/Grid Enabled Application-Level. Each of these categories has its own set of benefits and challenges, with each one building upon the previous.
Single-Threaded Application-Level Checkpointing
Single-threaded applications execute one task at a time. Compared to multiple threaded applications, single-threaded applications add less overhead but are slower in terms of application performance.
Various techniques are used to store the state of variables in these applications. The use of macros that are generated by the pre-processor to store global variables is one such example. The pre-processor encapsulates all local (stack) variables inside a particular scope into a structure to manage stack variables.
The application itself makes calls to additional functions, whenever they are needed. In order to rebuild the execution stack, a data structure called the “state stack” is used. The state stack contains three fields, one of which is a pointer to a function that saves the state of local variables.
For restoring the application, an application-level abstraction using Application-Level Checkpointing Techniques for Parallel Programs gets employed. Before each function call (except checkpoints) a label gets inserted that allows a restored application to skip unnecessary computation and jump directly to the next function call.
A common problem with Application Checkpointing, especially Heterogeneous Checkpointing, is handling pointers and other dynamically assigned data. To handle such matters, Karablieh, Bazzi, and Hicks have proposed a solution using a memory refractor in their paper—Compiler-assisted heterogeneous checkpointing. A memory refractor is essentially an array of data structures containing global variables, stack, and heap variables, that can effectively abstract the specific memory location from the particular variable by introducing an index.
There’s one more aspect of Single-Threaded Application-Level Checkpointing called fast-forwarding, which was introduced by Szwed et al.. Fast-forwarding is somewhat comparable to checkpoints/restarts and is a technique in which users may “fast-forward” to more interesting code parts by using native execution. This method makes use of the Cornell Checkpoint Compiler (C3), which uses a “position stack” that works very similar to a “state stack.” C3 has its own versions of the standard “C” memory functions that restore variables and pointers uniquely.
Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.
Check out what makes Hevo amazing:
Sign up here for a 14-Day Free Trial!
- Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
- Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Shared-Memory Application-Level Checkpointing
Shared Memory Application-Level Checkpointing is targeted at shared-memory applications and can be performed using either of the following techniques: variations in the C3 scheme, or MigThread.
Variation in the C3 scheme can be done using the OpenMP Multi-Threading Compiler Directives. This method stores variables similar to those described in Single Thread Application-Level Checkpointing. Within both of these techniques, C3 uses a memory pool to restore the pointer variables to their original location.
In the other technique, MigThread, Application Checkpoints are inserted by the programmer. It is the programmer’s responsibility to ensure that the program is accurate when inserting Application Checkpoints.
MigThread comes with unique features, which are its heterogeneity and its ability to checkpoint/restart individual threads on remote (possibly heterogeneous) machines. The key to that heterogeneity is a Data Conversion Technology called Coursegrain Tagged Receiver Makes Right (CGTRMR). This technique allows the sender (or checkpoint initiator) to save the checkpoint data in its own native format, one that is not present in the C3 system.
MigThread also supports Single Thread Checkpoints. This is done by calling the application on the target/recovery machine and specifying that a particular thread needs to be recovered. The thread specified in the configuration file is restored independently of other threads. This is especially useful in load balancing applications.
Grid/Cluster Application-Level Checkpointing
Grid technologies are the next generation of Distributed Computing, which provide several advantages like resource sharing, resource utilization, and improved computational speed to execute jobs requiring high processing power. A Grid comprises heterogeneous resources in the form of CPU, memory, storage, devices, instruments, and software applications that are connected to each other through various types of networks and platforms.
Despite the many advantages Grid systems offer, fault tolerance in Grid environments is a big challenge. The heterogeneous nature of resources, platforms, and networks can create faults in the Grid environment. Any interruption in the job execution may require a complete restart, if not checkpointed, and can be accompanied by wastage of time, and resources due to resource usage charges.
In the Grid/Cluster Application-Level Checkpointing, there are two typical coordinated checkpointing algorithms. The first, which utilizes a variation of the C3 system, and the second being XCAT3.
This checkpointing protocol for message passing systems utilizes the C3 system. Here, a Coordinated Checkpointing Protocol is established which processes user checkpoints altogether, by using an initiator that is responsible for coordinating the protocol. It is the initiator’s responsibility to begin the checkpoint when necessary.
A common problem faced when using these Checkpointing Message-Passing Systems is the arrival of late and early messages. A late message is one that crosses a checkpoint line in the forward direction. An early message is one that crosses a checkpoint line in the backward direction. This creates inconsistency in the global state because of the duplication of messages.
To solve this problem, a 4 phase protocol for C3 is used, where the initiator sends a pleasecheckpoint message to all processes indicating that a checkpoint should
take place when the process reaches its next checkpoint location.
The second way, XCAT3, is a somewhat different protocol for Grid-based Application Checkpointing. Three steps make up a high-level explanation of such protocol:
- First, communication channels are emptied.
- Second, individual processes checkpoint in their local state.
- Third, computation resumes.
This Checkpointing Protocol requires XCAT3 architecture along with some modifications, particularly in implementing the Application Coordinator.
The Application Coordinator essentially serves the same function as the initiator and adds Grid-specific functionality. In particular, the Coordinator provides references to the Grid file Storage Service. This is especially important in a Grid environment, because a global file system like NFS isn’t always available, and storing a checkpoint solely to local storage would be useless. As a result, a resilient storage mechanism is needed. This is provided via a “master storage service” in the XCAT3 system.
In the XCAT3 system, a checkpoint is initiated by a user who instructs the application coordinator to be a checkpoint. The application coordinator then sends each remote component a reference to an individual storage service on the master storage service as part of a storeComponentState message. Upon receiving the storeComponentState message each remote component stores its complete computation state at the location given by the individual storage service.
This concludes our guide. We hope the ideas and explanations given were clear and thorough, and that they assisted you in understanding Application Checkpointing entirely.
Application Checkpointing continues to be the focus of Checkpointing for the near future, whether it is Grid systems or High-Performance Computing Systems (HPC). This blog talked extensively about Cornell Checkpoint Compiler (C3) and its variations in Multi-Threaded Application-Level Checkpointing and Grid/Cluster Application-Level Checkpointing. We also discussed MigThread and XCAT3 systems specific to Multithreaded Applications and Grid Systems.
Hevo Data is an excellent ETL tool that is fast, scalable, reliable, and 100% accurate. It can transfer, transform and integrate data from 100+ sources (with 40+ free connectors) and can readily integrate your data from frequently used SaaS applications and databases to your chosen data warehouses like Google BigQuery, Snowflake, Amazon Redshift, or Firebolt, for analysis within minutes.
Visit our Website to Explore Hevo
Having a beneficial tool like Hevo that establishes a single source of truth for all of your company’s data will undoubtedly speed up your Data Analysis and Data Integration process and give a boost to your organization.
Why not give Hevo a try? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also check our unbeatable pricing and make a decision on your best-suited plan.
Post your comments on learning about Application Checkpointing, its benefits, uses, and types. We’d like to hear your thoughts and ideas.