AI agent Infrastructure Engineers
Role details
Job location
Tech stack
Job description
- Design, build, and optimize infrastructure for training, deploying, and scaling AI agents across distributed systems.
- Develop robust backend services, APIs, and orchestration frameworks that support multi-agent workflows and high-performance compute environments.
- Collaborate closely with research and product teams to integrate model-serving pipelines, memory systems, and reasoning components.
- Implement monitoring, observability, and failover mechanisms to ensure high system reliability and fault tolerance.
- Evaluate and refine infrastructure performance, identifying bottlenecks and improving efficiency across data, compute, and model layers.
- Participate in synchronous collaboration sessions (4-hour windows, 2-3 times per week) to review architecture decisions, troubleshoot distributed systems, and iterate on design improvements.
Requirements
Do you have experience in Rust (programming language)?, Do you have a Master's degree?, * Strong background in Computer Science, Software Engineering, or Systems Design, with focus on large-scale distributed infrastructure.
- Experience with cloud computing (AWS, GCP, or Azure) and containerization/orchestration tools such as Docker and Kubernetes.
- Proficiency in backend programming languages such as Go, Rust, Python, or C++.
- Familiarity with LLM inference pipelines, multi-agent architectures, or reinforcement learning environments is a strong plus.
- Knowledge of network optimization, data streaming, and caching architectures preferred.
- Excellent collaboration and communication skills.
- Ability to commit 20-30 hours per week, including required synchronous collaboration sessions.