About this role

Job Summary: In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required — your domain knowledge is what matters.

Skills

LinuxKubernetesPrometheus

Key responsibilities

Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
Monitor system health, analyze performance metrics, and proactively address bottlenecks or potential failures.
Automate operational processes to minimize manual intervention and increase system reliability.
Respond swiftly to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures.
Collaborate closely with development and operations teams to deliver seamless deployments and high system availability.
Create comprehensive documentation and clear runbooks for operational excellence and knowledge sharing.
Champion best practices in SRE, security, and compliance across the customer's ecosystem.

Required skills & qualifications

Expert-level hands-on experience with Linux system administration and troubleshooting.
Advanced proficiency with Kubernetes, including cluster deployment, operations, and management.
Deep knowledge of Prometheus for monitoring, metrics collection, and alerting.
Strong scripting abilities (Bash, Python, or similar) for automation and tooling.
Excellent written and verbal communication skills, with the ability to document and share knowledge effectively.
Proven track record in site reliability engineering or similar roles in high-availability environments.
Demonstrated commitment to proactive problem-solving and collaborative teamwork.

Preferred qualifications

Experience with other cloud-native tools (e.g., Grafana, Helm, Istio, or similar).
Certifications in Kubernetes, Linux, or cloud platforms.
Background in high-growth or large-scale production environments.

Apply on micro1 →

This role is posted on our partner platform. When you click Apply, you'll go to the posting, where the application, interview, skill validation, and onboarding all happen. lehico is an independent site that surfaces these opportunities — we don't process applications or guarantee acceptance.

Site Reliability Engineer

About this role

Skills

Key responsibilities

Required skills & qualifications

Preferred qualifications