Anne Benoit & Yves Robert - Joint Lecture
Variable Capacity Scheduling
Abstract: We formalize the problem of scheduling jobs on a set of machines, each having a fixed number of resources, and a probability to be alive, computed according to some probability distribution, e.g., random walk. The goal is to maximize the utilization. Several heuristics are designed to efficiently schedule jobs, cleverly deciding which machines to use (trade-off between the load and the probability that the machine will not survive until job completion). We will present simulation results, based on real traces, to compare the performance of various heuristics.
Checkpointing`a la Young/Daly
Abstract: The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a large-scale platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications. It has recently been extended to other application and resilience frameworks, such as workflows, silent errors, and imprecise knowledge of key parameters (MTBF and checkpoint duration). We provide some background and survey various scenarios to assess the usefulness and limitations of the formula.
Speakers
Anne Benoit
Anne Benoit is currently an Associate Professor in the Computer Science Laboratory LIP at ENS Lyon, France, and a Senior Member of Institut Universitaire de France. She is Editor-in-Chief of the journal Parallel Computing, Chair of the IEEE CS Technical Committee on Parallel Processing (TCPP), and a senior member of the IEEE. She has chaired the program committee of several major conferences in her field, in particular SC, IPDPS, ICPP and HiPC. Her research interests include algorithm design and scheduling techniques for parallel and distributed platforms, with a focus on energy awareness and resilience. See bit.ly/abenoit for further information.
Yves Robert
Yves Robert is a Full Professor at ENS Lyon, a Fellow of the IEEE and a former Senior Member of Institut Universitaire de France. He received the 2014 IEEE TCSC Award for Excellence in Scalable Computing, the 2016 IEEE TCPP Outstanding Service Award, and the 2020 IEEE CS Charles Babbage Award. He holds a Visiting Scientist position at the Innovative Computing Laboratory, University of Tennessee Knoxville, since 2011. His main research interests are scheduling techniques, parallel algorithms and resilient approaches for large-scale platforms. See~\url{http://graal.ens-lyon.fr/~yrobert/} for further information.