Cloud DevOps/ Site Reliability Engineer

Synergia - Egypt - Cairo

Experience: 3 to 6 years

Salary: Confidential

Employment: Full Time

Education: Not Specified

We are looking for a "DevOps / Site Reliability Engineer (SRE)" to join one of our client's team (an AI-driven SaaS startup developing a collaborative research platform). You will play a key role in ensuring system availability, performance, and cost-effectiveness while improving our CI/CD pipelines and cloud infrastructure.

Responsibilities:

Automation & Deployment: Develop and maintain CI/CD pipelines, automate software testing and delivery, ensuring fast, reliable, and consistent deployments. Streamline updates, reduces downtime, and enhances system stability.
Infrastructure & Cloud Management: Deploy, maintain, and optimize cloud resources (Azure, AWS, or GCP), ensuring cost efficiency and security compliance.
Monitoring & Observability: Build and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, NewRelic, Datadog, etc.) to track performance, identify bottlenecks, and enhance system reliability.
Incident Response & Troubleshooting: Analyze system failures, manage incident response, and implement long-term solutions to improve uptime and resiliency.
Security & Compliance: Rotate secrets, upgrade components, and follow cloud vendor security best practices.
Collaboration & Best Practices: Work with developers, QA, and product teams to ensure seamless releases, performance optimizations, and operational excellence.
Disaster Recovery & Scalability: Develop recovery plans and auto-scaling rules to maintain system performance under varying loads.

Required skills & Experience:At least 3 years of experience in DevOps/Site Reliability Engineering. Strong experience with CI/CD tools (Azure Pipelines, Jenkins, or CircleCI etc.). Proficiency in "containerization & orchestration" (Docker, Kubernetes, Helm). Experience with "cloud platforms" (Azure, AWS, or GCP) and cost optimization strategies. Strong knowledge of "monitoring & logging tools" (Prometheus, Grafana, ELK, Datadog, New Relic). Solid scripting/programming skills in "Python, Bash, or Go" for automation and troubleshooting. Experience with incident management, scaling strategies, and disaster recovery planning. Strong problem-solving skills and ability to work in a collaborative, fast-paced environment. Bachelor's degree in computer science or related field.Operational & Security KnowledgeBest Cloud Security Practices: Secret management, vulnerability patchingCost Optimization Strategies: Resource scaling, rightsizing, avoiding over-provisioningDisaster Recovery & High Availability: Backup strategies, failover mechanismsService Level Management: SLAs, SLOs, SLI monitorinExperience with GitOps & Service Meshes (Istio, Linkerd, Consul). Knowledge of serverless architectures and performance tuning.

Apply on the Job Website

Post date: Today

Publisher: