For a half century, supercomputers have performed the tasks that excite people’s imaginations – crunching enough data to simulate nuclear tests, map the human genome, and target the precise locations to drill new oil wells. And they’re taking on even bigger roles today. These high-performance computing (HPC) systems are powering a new wave of data-intensive applications that rely on artificial intelligence (AI), machine learning (ML), 3-D imaging (GPUs), and the Internet of Things (IoT).
HPC systems are big and strong, but sometimes things can go wrong. They perform quadrillions of calculations per second, and HPC clusters usually consist of thousands of compute servers that are networked together. If they go down or mis-map connections to key stores of data, they can delay important projects. In addition, they require intensive maintenance, updates, and system checks.
Given HPC’s vast importance – as the foundation for technical and societal advancements – it’s critical these systems perform at the highest levels. Many organizations devote the kind of attention their HPC infrastructure needs. However, some choose to invest more in hardware and software rather than in system optimization and support.
This happens for a variety of reasons. HPC users at some large organizations have been known to value the cache that comes with the use of one of the most powerful computers in the world. When they want to go faster, they invest in “horsepower” – additional nodes and accelerators. Some IT leaders running HPC systems feel having multiple high-powered systems with so much redundancy makes it less necessary to invest in support. Others are reticent to update long-running HPC systems regularly because they feel changes could inject unneeded doses of complexity and risk.
Savvy organizations realize that HPC needs to be nurtured, improved, updated, and continuously optimized to get the most out of their investments. This can be done in house, if they have the expertise, or by contracting with an outside vendor that specializes in third-party support if they don’t have the resources or prefer to dedicate them to activities that are more strategic for their business. Here are a few ways organizations can optimize their HPC systems and increase ROI from their HPC environments.
Perform regular health checks
Basic oversight often falls low on organizations’ list of priorities. Make sure to complete health checks at least twice per year – every quarter, if possible. Check the software and firmware versions. Check interdependencies to avoid introducing any incompatibilities that could percolate throughout the environment and impact uptime and performance. Proactive maintenance of the liquid cooling system can also save a lot of trouble. If one goes down, the system could heat up and damage CPUs and other components.
Be proactive
Regular health checks make sure the system is running the way it’s supposed to today. But what about the future? Is the organization looking at different kinds of projects down the line that might require new applications, new workloads, and new configurations? Rather than wait for a project to get introduced, look at new versions of hardware, software, storage, and networking that could create more flexibility when the time comes, and keep the compute performance at its optimum level.
Follow best practices
In terms of performance, organizations should look at best practices in the industry based on their use of the system. Enterprises should examine the cases they’re working on, how they’re configuring their systems, what applications they’re running, and which system architecture the apps are working on. If they change a set of workloads, the old configuration might not be as efficient anymore. If the data they had at the beginning is on premises and they want to include more data coming from the edge, they might need to adjust their system architecture.
Keep critical spares on hand
Organizations looking for short response times for mission-critical issues can arrange with a support vendor to keep important spare parts on site. For example, compute blades. If a node malfunctions for a few hours, jobs can be distributed to other nodes. But the admin servers which host the cluster manager software are critical – if one goes down, impacts the whole system.
Don’t forget network switching. If systems don’t have the right connection between their compute blades and data storage, they lose performance. Enterprises need to maintain the communication and data flows between where their data is stored, and where it needs to go. Any issues with networking parts and switches will impact the system performance that could delay the introduction of a new product.
Let a third party expert do the trouble-shooting
A reliable HPC services expert can help organizations dedicate less resources to system maintenance and optimization and keep focus on more added-value activities. HPC systems may incorporate IP from a lot of different vendors. From a procurement perspective an organization can handle all the interactions with outside vendors – or it could contract with a support vendor to manage the process for them. The third party can bring technology issues to resolution or, if this is the preferred option, just identify the issue and let the client deal directly with the vendor. Additionally, onsite Customer Engineers can act as an extension to customer staff and help meet performance SLAs through proactive maintenance.
Consider HPC as a service
Buying and maintaining HPC environments requires a huge capital expense (capex) up front and ongoing operational costs (opex) to cover energy expenses, human resources, and maintenance and repair costs. Shifting to an opex model, where the organization moves to the cloud and pays on a month-by-month basis, takes away that initial hit that many smaller organizations can’t afford. Larger organizations can maximize their flexibility through this model as well. For example, a car manufacturer that has budget for one data center but needs to double its compute power to accelerate the development of autonomous driving can take advantage of the cloud models – and cloud-like models delivering services on-premises – to keep research on track.
“[In] the last five or six years, there has been a steady shift of HPC to the cloud,” says Srini Chari, managing partner of Cabot Partners Group, an IT analyst firm in Connecticut. “It’s the fastest growing part of the HPC market. What’s happening here is that a lot of companies are realizing it’s quite a headache for them to actually manage infrastructure because of the rate and pace of change in technology and the skills needed to operate on-premises HPC. So, instead of buying technology, they are looking to use it as a cloud service.”
Conclusion
HPC systems are some of the most impactful drivers of technology today. They continue to run complex simulations, and they’ll be relied upon to run the AI applications that enterprises will use to develop their businesses in the future. To get the most out of AI, organizations will need to optimize their HPC environments. Making sure they’re operating smoothly and steering clear of risk will help organizations generate business value.
____________________________________