Roles & responsibilities:
- Designing, developing, and maintaining scalable and robust platform solutions to ensure the reliability and efficiency of our systems, with a focus on meeting Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Collaborating with cross-functional teams to define, design, and ship new features, ensuring seamless integration and deployment while maintaining high uptime and reliability.
- Ensuring the performance, quality, and responsiveness of applications by continuously monitoring and optimizing system performance and adhering to error budgets.
- Identifying and correcting bottlenecks and fixing bugs to maintain system stability and reliability, with a focus on minimizing Mean Time to Recovery (MTTR) and Mean Time to Detect (MTTD).
- Maintaining code quality, organization, and automation to streamline development processes and enhance productivity, ensuring that automated processes support high availability and reliability.
- Develop and maintain observability solutions to monitor system health and performance.
- Mentor and guide junior engineers in best practices and technical skills.
- Leverage Kubernetes for container orchestration and management in managed or self-hosted clusters.
Essential skills:
- Extensive knowledge and experience in Platform Engineering.
- Good exposure to Network Engineering and Software Development.
- Proficiency in programming languages such as Python, Go, or Java.
- Experience with microservice architecture and API design.
- Strong understanding of AWS and data structures. Experience working in Private and Public Cloud-based applications is a must.
- Experience working in high-volume, high-frequency streaming pipelines like Apache Kafka, Amazon Kinesis, and Apache Flink.
- Familiarity with tools such as Jira, Confluence, Docker, and Kubernetes/EKS.
- Experience with source control and CI/CD tooling such as GitHub, GitHub Actions, and Artifactory.
- Knowledge of monitoring tools such as Prometheus, Qualys, Grafana, AppDynamics, Observe, and Splunk.
- Experience with Site Reliability Engineering (SRE) practices.
- Strong skills in observability, including monitoring, logging, and tracing.
Education Qualification: Bachelor’s degree or Master’s degree in Engineering in Computer Science/Information Technology.