
SM - SRE/Kubernetes Engineer
- Remote
- Amsterdam, Noord-Holland, Netherlands
- Team SMART
Job description
Profile description: SRE/Kubernetes Engineer
Developer profile: This role is a mix of software engineering and systems administration, where you will use your coding skills to automate operational tasks, manage infrastructure as code, and reduce manual toil. You would contribute is to ensuring the reliability, scalability, and performance of our systems, with a specific focus on our GCP infrastructure.
Key Responsibilities:
System Reliability and Performance: Design, build, and maintain scalable, highly available, and secure systems on GCP. Define and track key metrics such as Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the error budget to balance reliability and innovation.
Automation and Toil Reduction: This is a core function of the role. You will be responsible for creating sustainable systems and services through automation. You will develop and implement automation for routine operational tasks, infrastructure management, and CI/CD pipelines to improve efficiency and reduce human error.
AI Integration: Leverage AI and machine learning tools to enhance SRE practices. This includes using AI-powered observability to analyze vast amounts of data and detect subtle anomalies and patterns that would be invisible to human observers, as well as using AI for predictive analytics to anticipate potential issues before they impact users. You will also use AI tools to automate root cause analysis and streamline incident response.
Cloud Infrastructure Management: Use Infrastructure as Code (IaC) principles to manage our GCP environment, including container orchestration platforms like Kubernetes.
Incident Management and Response: Participate in the on-call rotation to respond to and resolve critical incidents. Conduct blameless post-mortems and implement long-term fixes to prevent recurrence.
Observability: Implement and maintain comprehensive monitoring, alerting, and logging solutions to provide real-time visibility into system health and performance. You will build monitoring that alerts on symptoms rather than on outages.
Required Skills and Experience:
Experience: A minimum of 4 years of professional experience in a senior technical SRE role is required, with 6+ years preferred.
Cloud Platforms: Proven experience with Google Cloud Platform (GCP) is essential, including services like GKE, Cloudrun and BigQuery.
Programming and Scripting: Strong proficiency in Python or NodeJS is required for automation, tool development, and general scripting.
Databases: Experience with both relational and NoSQL databases, such as MariaDB, MySQL, and MongoDB, is a key asset.
Containerization and Orchestration: Hands-on experience with Docker and Kubernetes is a must.
CI/CD: Experience with CI/CD tools and building automated pipelines.
Operational Mindset: A proactive approach to problem-solving, with a focus on preventing issues before they occur.
Communication: The ability to articulate complex technical issues and solutions to both technical and non-technical stakeholders.
Job requirements
Excellent troubleshooting and debugging skills.
Strong communication and collaboration skills, able to work in a distributed team.
Ability to work with software vendors to evaluate new tools and technologies.
If this sounds like you, we’d love to meet you!
or
All done!
Your application has been successfully submitted!

