Autonomous Vehicle Infrastructure Systems Lead, Manager - Managed AI
Company: Deloitte
Location: Austin
Posted on: June 25, 2022
Job Description:
Autonomous Vehicle Infrastructure Systems Lead, Manager -
Managed AI
The Team
The Deloitte Connected and Autonomous Vehicle (CAV) team is
catalyzing and shaping the Autonomous Vehicle (AV) market through a
suite of turnkey, as-a-service solutions that deliver improved
performance and lower total cost of ownership. These solutions will
empower Automotive customers to realize their autonomy ambitions as
efficiently as possible.
High Level Role
We are looking for a seasoned, "hands-on" HPC/AI infrastructure
systems leader who will drive the scope, detailed design, and
deployment of AV infrastructure across on-prem, cloud, and hybrid
environments. The key success measure of this prototype will be the
delivery of Deloitte's offering in POD configurations as a service
for our customers with guaranteed SLAs and TCO targets.
Specifics:
- Establish the detailed specification of the DGX A100 that
reflects a representative customer's planning, deployment, and
on-going operations optimization requirements on TCO, throughput,
scalability, and flexibility with their varied workloads
- Set up the DGX/Super POD reference environment including DGX
A100 compute nodes, fabrics (storage/compute), management networks
& software (DeepOps), key system software for optimizing GPU
communications I/O and application performance, and user run-time
tools for SLURM and Kubernetes containers
- Design and document the most efficient setup to meet success
metrics (TCO, performance, scale). Specific areas of focus:
- Network switch & fabric considerations for non-blocking,
scalable bandwidth needs for best performance with varying dataset
sizes & locations
- Storage and caching hierarchy implementations based on training
vs inferencing workloads. Establish storage management guidelines
for RAM/NVMe (internal storage) and external high speed storage
(DDN, Netapp, etc.) allocation to optimize performance and cost of
running varying data-sets and workloads. Establish rules for when
to trigger GPU Direct Storage (GDS) feature for lower latency and
faster I/O workloads.
- Management Servers - infrastructure design & setup for
enabling- user logins, provisioning (OS images & other internal
infrastructure services for the pod), Work-load management
(resource management and scheduling/orchestration), container
mgmt., system monitors/logs
- Operations/run-time optimization of A100 compute resources (MIG
partitions) for varying workloads to maximize the utilization and
throughput of jobs being scheduled in a given node cluster
- Validate the commercial model with the MVP operational
run/playbookMinimum Qualifications:
- Bachelor's Degree equivalent experience in Computer
Architecture, Computer Science, Electrical Engineering or related
field. Advanced degree preferred
- 6+ years of proven experience in design, deployment, and
operations of HPC production grade environments leveraging both
SLURM and Kubernetes clusters
- Deep understanding of scale out compute, networking, and
external storage architectures for optimizing performance and
acceleration of AI/HPC workloads
- Proven experience deploying, upgrading, migrating, and driving
user adoption of sophisticated enterprise scale systems.
- Prior software, solutions development background and proven
ability to demonstrate complex new technologies
- Programming skills to build distributed storage and compute
systems, backend services, microservices, and web technologies
- Well versed in agile methodology
- Comfortable with a customer focused, high paced
environment
- Ability to travel up to 50% on average, based on the work you
do and the clients and industries/sectors you serve
- Limited immigration sponsorship may be available
AI&DE23
Keywords: Deloitte, Austin , Autonomous Vehicle Infrastructure Systems Lead, Manager - Managed AI, Executive , Austin, Texas
Didn't find what you're looking for? Search again!
Loading more jobs...