Deloitte is offering AI based infrastructure as a service (IaaS) as a "single-stop", end to end managed service for customers doing AV Development/Test and Simulation. We will be offering this service based on Nvidia's DGX/A00- Super-Pod reference design in on-prem or Colo configurations.
A key part to building this practice will involve setting up an internal DGX Super-pod reference & training environment as a "first prototype" in our Deloitte Data Center with all the automation services 'built in" for offering this as a service (pay by the drip consumption) to Automotive customers.
6+ years' experience on DGX/Super POD, DGX A100 Compute nodes, Fabrics (Storage/Compute) , Management networks & Software (DeepOps), Key system software for optimizing GPU communications I/O and application performance,
4+ years' experience establishing storage management guidelines for RAM/NVMe (internal storage) and External high speed storage (DDN, Netapp..) allocation to optimize performance and cost of running varying data-sets and workloads
4+ years' experience in design, deployment, and operations of HPC production-grade environments leveraging both SLURM and Kubernetes clusters
Deep understanding of scale-out compute, networking and external storage architectures for optimizing performance and acceleration of AI/HPC workloads
Management Servers - infrastructure design & setup for enabling- user logins, provisioning (OS images & other internal infrastructure services for the pod), Work-load management (resource management and scheduling/orchestration), container mgmt. system monitors /logs
Operations /run-time optimization of A100 compute resources (MIG partitions) for varying workloads
Working experience in git, conda, pip, yum, apt, zypper, julia, npm and a multitude of other installation frameworks
Development of docker containers to process AI/ML/DL workloads in HPC environment.
Debugging code at all levels using gdb, strace, tcpdump, wireshark, and other tools to find the root cause of issues.
Familiarized with deep learning frameworks such as PyTorch, Tensorflow, and CuDNN to learn how to integrate technologies with MPI protocol libraries openmpi and mvapich2.
BE in computer, Masters or equivalent experience in Computer Architecture, Computer Science, Electrical Engineering or related field.
Travel up to 30% of the time (Monday - Thursday/Friday). (While 30% of travel is a requirement of the role, due to COVID-19, non-essential travel has been suspended until further notice.)
Limited Immigration sponsorship may be available
Proven experience deploying, upgrading, migrating, and driving user adoption of sophisticated enterprise scale systems.
Creating custom python based metrics and analytics solution to profile HPC and Hadoop
Creating custom reporting dashboards in grafana from prometheus kubernetes metrics.
Programming skills to build distributed storage and computer systems, backend services, microservices, and web technologies.
Well versed in agile methodology.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability or protected veteran status, or any other legally protected basis, in accordance with applicable law.