Acun’s Focus on HPC Energy Efficiency Leads to Dissertation Recognition from SIGHPC
Acun’s dissertation has been recognized with an honorable mention in the inaugural Doctoral Dissertation Award competition of the Association for Computing Machinery Special Interest Group on High Performance Computing (SIGHPC).
Acun (PhD CS ’17) said the recognition of her dissertation, “Mitigating Variability in HPC Systems and Applications for Performance and Power Efficiency,” is a gratifying reflection of the increasing importance of efficiency in high-performance computing.
“It is great to see a work on energy efficiency being recognized with this award,” she said. “The HPC community is traditionally solely focused on performance, but metrics beyond execution time are becoming increasingly important, including energy and power, and the award signifies that importance to me.”
Improving energy efficiency is important to Acun, she said, because of concerns about climate change and the growing need for energy to power high-performance computers. As noted in her dissertation, a petaflop supercomputer requires millions of dollars’ worth of machine and cooling power each year.
“(And) the size of the data-centers and supercomputers keeps growing,” she said.
Acun, who is now a researcher at IBM’s Thomas J. Watson Research Center in New York, began working on processor variability as a PhD student following a suggestion by a fellow student, Phil Miller (PhD ’16 and now the owner of a computing consultancy firm). Miller and Acun were both part of Professor Laxmikant “Sanjay” Kale’s Parallel Programming Laboratory.
In her dissertation, Acun analyzed frequency, temperature, power, and application-level variations in large-scale HPC systems, identifying their causes and proposing solutions:
- Frequency variations, she found, were a result of differences in the power efficiency of chips because of the manufacturing process. She proposed speed-aware dynamic load balancing strategies to mitigate those variations.
- The manufacturing process was also the culprit behind power variations, leading Acun to propose variation-aware node-assembly methods.
- Acun pointed to inefficiencies in fan-based cooling systems as the cause of temperature variations, and settled on decoupled fan-control mechanisms with a learning-based temperature prediction model as a solution.
- And finally she pinpointed characteristics of the applications themselves (such as different kernel types) as the source of variance at the application level. Her solution was a fine-grained runtime based technique to mitigate those variations.
“Energy and performance efficiency improvements are historically enabled by hardware improvements, however I see runtime systems as a big opportunity for computer scientists to help,” she said.
At IBM Research, Acun is continuing to work on improving the power efficiency as well as the performance of heterogeneous systems.
For inspiration she draws on the scientific applications that current large-scale computers and future exascale systems could make possible – new ways to treat cancer, for instance, and giving doctors a better understanding of the brain and heart through simulations.
Acun says she finds inspiration in her experience in the Parallel Programming Laboratory, where Kale – her PhD advisor – fostered a culture of collaboration and respect.
“Professor Kale has a wealth of technical knowledge in parallel computing and HPC area,” she said. “He allowed me to grow as an independent researcher and work on the topics that I cared about."