Using a Data Access Layer in HPC

Posted by Scott Jeschonek on Fri, Feb 03, 2017 @ 10:55 AM

Blog_HPC.jpgHigh-performance computing in the cloud is becoming more and more common, as companies realize its potential to extend their HPC environments without additional significant infrastructure costs. HPC clusters bring hundreds or sometimes even thousands of cores online quickly and efficiently — something perfect for the massive compute farms managed by cloud service providers (CSPs) like Google Cloud Platform and Amazon Web Services. Very large clusters can be created on-demand. You aren’t restricted to one cluster either; you can create any number of clusters on demand, empowering different teams without causing resource contention. 

Photo Credit: iStock Photo

In addition, CSPs optimize networking within their compute environments minimizing compute latency. In the case of Google, they have a software-defined networking environment that offers very low latency between virtual machines and disks. This type of optimal networking creates compute performance difficult to match in many private data center environments. Another big win for HPC in the cloud.

When it comes to cost, the cloud may at times seem more expensive, but you have to factor in two critical facts: you can expand without buildout costs and your operational model does not scale linearly with the amount of cloud resources you use. Using APIs, you can create tools that create 10 or 10,000 compute environments, fully networked, attached to disks. Often that tool is a simple python script. You can then delete those same compute nodes just as easily.

Further, cloud providers hate when capacity goes unused, so they’ll offer you a break outside of peak usage times. In Google Compute Engine, these are referred to as preemptible instances. Preemptible instances run at 80% lower costs than regular virtual machine (VM) prices. The pricing is flat and offers the same performance as regular VMs. In Amazon EC2, these are called spot instances and they work on a bidding/auction model. The flat, predictable pricing offered by Google makes these virtual machines ideal for batch, grid, and fault-tolerant workloads like those found in HPC environments running applications like Spark, Monte-Carlo Simulations in financial analysis, GATK for genomics, and those in rendering like Autodesk Maya and Houdini.

Several options exist for putting HPC workloads into the cloud. You could do a “Pure” Cloud HPC approach where jobs are run completely within cloud compute services and utilize cloud storage. However, the reality is that most organizations have significant investments in on-premises servers and storage, and they have fine-tuned these infrastructures for their workloads. Hybrid HPC offers the best of both worlds — the ability to manage existing environments and expand infrastructure into the cloud as needed.

Data Access Challenges

Regardless of which model you choose, data latency will prove to be a challenge. This latency is actually amplified in the cloud because:

  • It can be difficult to locate all of the needed data next to the worker nodes, whether that data is in cloud storage or in your NetApp, EMC, OR ZFS boxes.
  • Moving data to solve for this is time consuming and you copy everything to the cloud? What if your application only uses specific ranges of blocks of a rather large file?
  • Pipelines may require multiple writes of results which either introduce consistency risk in local storage or introduces latency as they reach back to on-prem NAS.

Your organization may have data policies which prohibit the permanent storage of data in the cloud, which further complicates access.

How can you SOLVE FOR data latency in Cloud HPC?

Adding a data access layer that caches active data closest to cloud compute resources minimizes data latency between file-based storage and compute nodes.

In this model, the algorithms read ahead and start populating the cache as more and more nodes come up to support the HPC workload. Because of this cache, only active loads are placed on expensive block storage instead of entire data sets. When workloads like those found in the above-mentioned scenarios access the same cloud hundreds of times, data caching layers make sense.

Orchestration tools, such as Cycle Computing’s CycleCloud offering, can aid in priming data into the access layer, and ensuring that the data is available prior to launching too many of the cluster nodes, thus helping to make the HPC cluster more efficient and lower cost.

Advantages of a Data Caching Layer

Running Cloud HPC with a data access layer offers several advantages.

  1. Data can remain in on-premises storage. Data is only in the cloud while compute nodes do their thing, and only the data that was required. This greatly overcomes many security objections and simplifies the use of the cloud.
  2. Further, if the storage is object based, Avere also offers AES-256-bit encryption for the at-rest data as well as for the SSL connectivity. Key management for encryption can be offered via shared key or KMIP services integration.
  3. Cloud compute performance improves. The use of file system caching puts most of the data in RAM, close to the nodes avoiding ingest latencies and slashing transit latencies after the first read.
  4. Scale-out is simplified. The solution facilitates tens of thousands of core file system connections making scale out possible.

In many situations, the data access layer can improve the user experience of cloud HPC. The Avere vFXT is a proven, reliable file system that creates a scalable data access layer to support grid computing in the cloud. In a recent webinar, I explain the concept further along with representatives from Google Cloud Platform and Cycle Computing. Watching should give you a solid comprehensive foundation for building and managing a cloud HPC environment.


Watch the entire webinar on-demand: Solving Enterprise Business Challenges with Scale-Out Storage & Big Compute

Scale-Out Storage & Big Compute

Topics: HPC, Data Center Management, Technology Community