Senior Cloud Operations Engineer
Senior Cloud Operations Engineer, PyTorch
Linux Foundation | USA – Remote | Full-time
Linux Foundation Overview
The Linux Foundation is a driving force in fostering open source collaboration and supporting communities across a range of projects, including PyTorch. We are dedicated to enhancing and expanding our infrastructure to meet the growing demands of PyTorch and related AI projects. We are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment.
Role Overview
The Senior Cloud Operations Engineer will play a pivotal role in the PyTorch Foundation, leading cloud infrastructure and DevOps initiatives. This position is crucial for maintaining and optimizing the technical operations that support PyTorch, one of the world's leading open source machine learning frameworks. The ideal candidate will blend expertise in cloud technologies, DevOps practices, and open source collaboration to ensure PyTorch's infrastructure remains robust, secure, and efficient.
Responsibilities
Cloud Infrastructure Management
Manage multi-cloud environments, primarily focusing on AWS services (EKS, EC2, S3, IAM, ELB)
Contribute to architectural exercises with open source community and technical leads to validate new cloud infrastructure
Implement and maintain infrastructure-as-code using Terraform via pytorch/ci-infra and pytorch/test-infra
Optimize cloud resource utilization and implement FinOps practices for cost management and reporting
CI/CD and DevOps
Design, implement, and maintain CI/CD pipelines using GitHub Actions and ARC, including runner configurations and other elements of the CI ecosystem
Debug and triage issues in build and test pipelines, including experience with unit testing
Develop monitoring and alerting solutions for CI/CD workflows and critical infrastructure
Performance Optimization and Security
Manage and optimize Cloudflare CDN deployments for PyTorch assets (R2/S3)
Implement best practices for CDN and overall infrastructure security
Monitoring and Incident Response
Develop comprehensive monitoring and observability solutions using Datadog, AWS CloudWatch, and other telemetry data collection and processing tools
Review and recommend monitoring solutions as project and community needs evolve
Participate in on-call rotations supporting operations and incident response using incident.io
Establish and maintain escalation procedures and resolution processes
Community Collaboration and Project Management
Participate in ci-infra and multi-cloud working groups and support architecture decisions
Collaborate with external contributors and promote DevOps best practices
Manage GitHub repositories, including user onboarding and access control
Attend and contribute to technical meetings, including Infrastructure, CI Workflow, and Technical Advisory Council sessions
Documentation and Best Practices
Develop and maintain technical documentation for infrastructure and processes
Provide guidance on developer best practices and tooling
Create and update runbooks for common operational tasks and incident response
Qualifications and Experience
Required
Ability to work with communities made up of industry specialists and collaborate outside of the Linux Foundation
Bachelor's degree in Computer Science, Engineering, or related field
7+ years of experience in cloud operations with significant AWS expertise
Strong knowledge of infrastructure-as-code principles and tools, particularly Terraform
Proficiency in scripting languages (Python, TypeScript, Bash) and containerization technologies (Docker, Kubernetes)
Experience with Cloudflare CDN management and optimization
Expertise in implementing and managing monitoring solutions, specifically Datadog and AWS CloudWatch
Familiarity with incident management tools and processes, particularly incident.io
Demonstrated experience in CI/CD pipeline design and implementation
Strong problem-solving skills and ability to troubleshoot complex systems
Excellent communication skills and experience collaborating with open source communities
Preferred
Experience with PyTorch or other open source communities
Multi-cloud expertise across AWS, GCP, and Azure
GitHub ARC experience
Knowledge of FinOps principles and cloud cost optimization strategies
Contributions to open source projects, especially in infrastructure management roles
Familiarity with the Linux Foundation or similar open source foundations
Experience mentoring other engineers and fostering a collaborative team environment
Additional Information
Salary $95,000 - $133,000 USD
About Us
The Linux Foundation maintains a predominantly remote workforce and is committed to hiring top-notch talent. We are as passionate about providing a flexible and supportive work culture as we are about open source software. Collaboration is embedded in our DNA, and we take pride in our ability to work closely together while not being confined to a traditional office space.
The Linux Foundation is unable to provide visa sponsorship for this position. Candidates must be authorized to work in their country of residence without employer sponsorship, now or in the future.
- Department
- IT
- Locations
- Remote
About The Linux Foundation
The Linux Foundation is a neutral, trusted hub for developers and organizations to code, manage, and scale open technology projects and ecosystems.