Senior HPC Performance Engineer
Role details
Job location
Tech stack
Job description
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
- Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
- Triage and root-cause performance issues reported by our customers
- Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information
- Collaborate with a very dynamic team across multiple time zones
Requirements
- M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience
- 3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Experience conducting performance benchmarking and triage on large scale HPC clusters
- Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
- Implement micro-benchmarks in C/C++, read and modify the code base when required
- Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python
- Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)
- Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones
Ways to stand out from the crowd:
- Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control
- Experience debugging network issues in large scale deployments
- Familiarity with CUDA programming and/or GPUs
- Experience with Deep Learning Frameworks such PyTorch, TensorFlow
Benefits & conditions
NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. Our teams are composed of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all.