
Noon
Site Reliability Engineer Jobs in Dubai, UAE
Job Description
As a Site Reliability Engineer (SRE) at noon payments, you will play a crucial role in maintaining and enhancing the reliability, availability, and performance of our cloud-based infrastructure and services.
You will be responsible for automating deployments, optimizing systems, and ensuring seamless performance across our platforms. This position requires a strong foundation in cloud infrastructure management, particularly with Azure – AKS and GCP-GKE, alongside hands-on experience with Azure DevOps and monitoring tools like Datadog.
You will:
- Cloud Infrastructure Management: Manage and optimize cloud environments across Azure and GCP, ensuring efficient resource utilization, high system availability, and scalability (AKS-GKE).
- Infrastructure as Code: Utilize Terraform for infrastructure provisioning, ensuring consistent and scalable deployments, and managing infrastructure via Azure DevOps pipelines.
- Configuration Management: Implement and manage system configurations using Ansible to ensure consistency and streamline updates across different environments.
- Continuous Integration/Continuous Deployment (CI/CD): Develop, maintain, and optimize CI/CD pipelines within Azure DevOps to automate testing and deployment processes, reducing time from development to production.
- Monitoring and Observability: Set up and maintain comprehensive monitoring and observability solutions using Datadog to track system health, performance, and proactively detect issues.
- Container Orchestration: Deploy, manage, and optimize Kubernetes clusters to support scalable and resilient application deployments.
- Incident Management: Participate in a 24/7 on-call or roster-based team to respond to incidents, conduct root cause analysis, and implement solutions to minimize downtime and ensure system reliability.
- Performance Tuning: Continuously monitor system performance, identify bottlenecks, and implement optimizations to improve efficiency and response times.
- Capacity Planning: Plan and manage system capacity to ensure resources meet current and future demands, enabling seamless service delivery.
- Collaboration: Work closely with Network Operations Center (NOC) and DevOps teams to troubleshoot issues, optimize deployment processes, and drive continuous improvement.
- Documentation: Create and maintain detailed documentation for system configurations, deployment processes, and incident reports.
Skill Requirements
- Bachelor’s degree in computer science, Information Technology or any other related discipline or equivalent related experience.
- Cloud, ITIL, CKA certifications are a plus.
- 6+ years of directly related or relevant experience, preferably in information security.
- Extensive experience with cloud platforms such as Azure, GCP, and Huawei Cloud.
- Proficiency with Terraform for infrastructure automation and Ansible for configuration management.
- Hands-on experience with Kubernetes for container orchestration mainly AKS and GKE.
- Expertise in monitoring and observability tools such as Datadog.
- Familiarity with Azure VMSS, GCP MIG for virtual machine scaling and management.
- Experience in a 24/7 on-call or roster-based team environment, focusing on system uptime and incident response.
- Strong understanding of SRE processes and best practices for system reliability, availability, and performance.
- Excellent problem-solving skills and the ability to handle complex technical issues under pressure.
- Effective communication skills and a collaborative approach to working with diverse teams.
- Experience with payment gateway projects or similar high-transaction systems is preferred.
- Additional knowledge in advanced monitoring techniques, performance tuning, and capacity planning is a plus.
To apply for this job please visit www.linkedin.com.