Do IT Now is looking for a Senior HPC Engineer to join an established technical group delivering Worldwide HPC and AI infrastructure services to customers across research, automotive, life sciences, and manufacturing.
The required role is responsible for design, deployment and operational excellence of HPC and AI clusters across customer sites and managed environments.
Cluster Design and Deployment: Lead the architectural design and on-site or remote deployments of HPC and AI clusters. Covering GPU systems, high-speed interconnects, parallel file systems and management infrastructure.
Workload Manager: Configure, tune and maintain schedulers like Slurm, PBS and Gridengine.
Fabric Engineering: Deploy and operate high-speed network fabrics (InfiniBand NDR, RoCE), including topology design, SM configuration, validation and testing.
Customer Interface: Serve as technical counterpart for customer or partner architects and operation leads.
Technical Writing and Training: Author runbooks, design documents, acceptance documents and handover materials. Provide technical training sessions for customers and mentor junior engineers.
Essential skills
Preferential requirements.
Sign in to browse authentic reviews, anonymous ratings and salary data before you apply.