AI for Science – Early Lessons from NERSC’s Perlmutter Supercomputer

Roughly a yr in the past the National Energy Research Scientific Computing Center (NERSC) launched Perlmutter, which was hailed on the time because the “world’s quickest AI supercomputer” by Nvidia whose GPUs present a lot of Perlmutter’s energy. Since then, NERSC has been aggressively ramping up its blended AI-HPC workload functionality – software program, early science apps, AI instruments, coaching, and so on. What have we realized thus far?
At this week’s AI Systems Summit, Wahid Bhimji, group lead and a giant information architect within the information and analytics providers group at NERSC, supplied a fast-moving tour of NERSC/Perlmutter’s leap into the AI-for-Science world.
“We see at NERSC that AI for science has matured past proof of ideas and truly into manufacturing. But it’s solely on the verge of getting a transformative affect,” mentioned Bhimji. “To do that can require utilizing supercomputing scale and in addition coupling to present scientific software program, giant scale scientific simulations, and in addition massive scientific datasets. That’s a task for facilities like NERSC, however work is required throughout mannequin improvement and purposes in addition to deploying appropriate computing and instruments and applied sciences and strategies to make use of this computing.”

Named for the Nobel Prize successful cosmologist (Saul Perlmutter) NERSC’s newest system (specs in slide above) continues to be in “an early science section the place we’re exploiting this technique for explicit codes to shake it out however not charging the conventional manner for hours,” mentioned Bhimji. Perlmutter contains 12 cupboards with over 6,000 GPUs, and in addition options an all flash Lustre file system. “Phase two of this technique is coming quickly and consists of CPU-only cupboards for [when] we run different science codes that may’t essentially make use of GPUs. It can even embody an improve to the entire system networking making use of HPE/Cray’s new Slingshot Ethernet-based excessive efficiency community.”
In line with expectations NERSC has seen a soar in AI workflow submissions, famous Bhimji.
“We know this by means of instrumentation which guides what we are able to deploy and ask. For instance, we’ve instrumented a big fraction of the Python inputs on the system whether or not or not they use our Python software program.  We have a hyperlink [taken from the slide below] to the paper that exhibits how we do that. Through this, we are able to be taught a number of classes, for instance, the big development in customers of PyTorch and TensorFlow, the general quantity tripling from 2018 to 2020, after which doubling once more in ’21.”
NERSC additionally does an everyday person survey. “We can see that we’ve deep studying customers [and] machine studying customers throughout totally different science disciplines. [We’ve also seen] that there’s a necessity for computing scale in that individuals’s fashions typically take days and even weeks on a single GPU or on single assets with out utilizing distributed ones.”
Bhimji roughed out a number of the classes realized from early deployments and the survey. Broadly the concepts NERSC is gleaning now will assist it put together its numerous techniques, together with Perlmutter, for broader use by scientists looking for to make use of AI. The classes are additionally helpful, he hopes, for different establishments.
“The first is that we see a requirement for installations the place performance and efficiency are sort of assured by us. From this survey we might see, maybe surprisingly, that almost all of individuals really use the modules we offer. But additionally, individuals want to have the ability to customise and set up their very own packages alongside the software program. So we have to assist recipes for utilizing a apartment cloning setting and constructing on high of it. For Perlmutter, we determined to discover and at present present each our personal compiler software program but in addition make use of the NGC containers that NVIDIA supplies,” he mentioned.
“Now, not all HPC facilities assist containerization, however we’ve [supported it] for some time by means of a technique known as Shifter which makes performance-secure use of Docker containers and works effectively with the NGC containers and permits you to pull them in instantly. This was essential actually within the deployment section of Perlmutter to make sure a steady and performant software program setting regardless of adjustments within the underlying system software program stack. That mentioned, we did have some deployment points [and] because of a detailed collaboration with Nvidia. we have been in a position to resolve – so this consists of issues [like], variations between the Docker and Shifter container stack.”

Given AI’s relative newness, it’s vital to offer scientists flexibility to strive totally different AI approaches, mentioned Bhimji.
“Scientists want the power to experiment, notably on this section, the place they’re exploring a lot of totally different AI fashions for their science. To do this they want interactivity and a technique to supply that’s by means of JupyterHub. We have a very fashionable JupyterHub service at NERSC that has effectively over 2,000 customers. It’s additionally a favourite manner for individuals to develop machine studying code. So, all of those bars right here within the survey (slide under) are individuals utilizing Jupyter both at NERSC or elsewhere. At NERSC you should use Jupyter with both shared assets and even devoted GPU nodes. You can get 4 GPUs simply for your workload, and even a number of GPU nodes. It’s attainable to start out providers that wait on the again system, however then can get fairly giant distributed assets. We additionally present kernels for deep studying software program with optimized variations, but in addition individuals can construct their very own kernels.”
Automation is one other fascinating component to supply, mentioned Bhimji.
“A specific space for that is in hyperparameter optimization. Model choice and tuning continues to be fairly vital in these deep studying fashions for getting efficiency and that is computationally costly, which suggests a necessity for HPC once more. But many various instruments exist for performing this and we’ve to assist utilizing fairly a lot of these NERSC. We’ve additionally seen that some [of these] can want adaptation to work effectively in our again techniques and our back-system insurance policies and the multi-user setting we’re in. So we’ve some work – for instance, this weblog submit describes some work with Ray Tune to essentially allow these instruments to work effectively on our techniques,” he mentioned.
As seen on the backside proper of the third slide under, Ray Tune was used to dramatically lower runtime on graph neural community fashions used on a NERSC catalyst deep studying undertaking.

Aside from instruments to assist particular person researchers, it is usually essential to tune the general system for AI workflow necessities. Perhaps surprisingly, a part of how NERSC has achieved that is by collaborating within the MLPerf train, together with work on creating the MLPerf HPC benchmark (HPC v.7) which debuted at SC20 and was run once more at SC21 (HPC v1.0). The newest model included a brand new benchmark, OpenCatalyst, and in addition separated out strong-scaling and weak-scaling. The checklist of collaborating techniques was spectacular:  Fugaku, Piz Daint (CSCS), Theta ANL), Perlmutter (NERSC), JUWELS Booster (Jülich SC), HAL cluster (NCSA), Selene (Nvidia) and Frontera (TACC).
“We’ve been closely concerned from the beginning of this HPC working group throughout the MLPerf group and that is aiming to have a look at not solely coaching usually but in addition notably for scientific purposes, and notably [use of] HPC assets,” mentioned Bhimji. He famous the brand new weak scaling metric, “actually permits you to fill a giant system with a number of fashions. So, they received submissions from numerous giant HPC facilities around the globe. The [year-to-year] outcomes are improved, which exhibits some progress right here with large-scale submissions each for the robust scaling time-to-train benchmark a single mannequin, but in addition these weak-scaling submissions at large-scale on Perlmutter and in addition on the world’s primary system Fugaku.”
“So what does this imply for Perlmutter? We have been in a position to run this very early in Perlmutter’s deployment, which was actually useful for shaping the system and guaranteeing the dimensions of deep studying we wished to do on the machine. We received some affordable outcomes – you’d really want to match with the entire outcomes – however I can inform you that we have been the main time-to-train outcomes for the OpenCatalyst benchmark and close-second place to a well-tuned system, Nvidia’s Selene supercomputer. We additionally had the largest-scale GPU run, making use of a big fraction of the Perlmutter machine,” he mentioned.
Bhimji famous, “It was okay coming second place to Selene as a result of it allowed us to do some in-depth profiling afterwards to grasp why we had any distinction. From that, we might see that the dominant bottleneck was really from the community. And so I don’t count on you to learn this profile however the all-reduce stage was really fairly a bit slower than on Selene. But really that is excellent news as a result of we all know that Perlmutter is having its community upgraded and we count on a doubtlessly 4x enchancment simply from the {hardware}. We additionally even understood smaller one-node variations that we noticed as coming from unoptimized kernels on this explicit implementation, which is in MXNet. Those kernels will in all probability be improved, however for the time being are reminiscence bandwidth sure since Selene was utilizing the A100 GPU’s bigger (80GB) reminiscence.”

Bhimji additionally introduced transient summaries of early science work that illustrated AI’s capability to hurry evaluation (astronomy), enhance simulation (climate/local weather), and automate (catalyst dataset). The following slides summarize all three initiatives.

Recommended For You