Quoting from EverGrid official announcement:
Evergrid, Inc., a provider of advanced quality of application service management for next generation datacenters, today announced its entry into the high performance computing market, with patent pending high availability and resource management software that lets massively parallelized distributed applications run at near 100 percent reliability on high performance computing (HPC) clusters.
…
The Evergrid software sits between the operating system and the applications, and captures the collective state of the application and its IO across all processors. By recording the state of the application, Evergrid is able to checkpoint and recover from failures rapidly with minimal overhead. The software also allows data centers to do preemptive scheduling of lower priority applications in favor of running higher priority applications, with little or no data lost. The software installs on Linux systems and requires no modifications to either the OS or application. It is scalable up to thousands of nodes at a time, with less than five percent performance overhead.
…
Evergrid’s new fault tolerant software prevents downtime by automating the checkpointing, migration and recovery of applications, thus offering automatic failover across multiple nodes and tiers. With Evergrid, even failure of multiple processors does not stop an application from functioning continuously. In addition, Evergrid’s efficient and robust management software provisions servers from bare metal up through the application and allows preemptive allocation of resources to high priority applications…