Site Reliability Engineer
About the Role
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our
team. The ideal candidate will have strong expertise in managing
- scale distributed
systems, driving automation, and ensuring system reliability, performance, and scalability. You
will play a critical role in monitoring, diagnosing, and improving our infrastructure while
collaborating with
- functional teams to deliver robust and resilient solutions.
Responsibilities
• Design, implement, and manage reliable, scalable, and secure distributed systems in
production.
• Monitor system performance, diagnose complex issues, and ensure quick recovery
during outages.
• Drive observability practices by building dashboards, alerts, and monitoring solutions
using Splunk, Dynatrace (or equivalent APM tools), and other monitoring platforms.
• Collaborate with development, QA, and operations teams to improve system
architecture, CI/CD pipelines, and automation.
• Lead root cause analysis (RCA) of incidents and recommend improvements to enhance
resilience and performance.
• Implement best practices for disaster recovery, graceful degradation, and capacity
planning.
• Conduct performance, load, and redline testing to proactively identify bottlenecks and
scalability issues.
• Contribute to infrastructure and configuration management processes and support
version control best practices (Git).
• Act as a subject matter expert for SRE methodologies, tools, and industry best
practices.
Technical Skills & Experience
• Experience: Minimum 7+ years managing, diagnosing, and debugging
- scale
distributed systems in production.
• Core Systems Expertise: Web servers, relational and
- relational databases,
caching, Pub/Sub systems, containers (Docker/Kubernetes), resiliency & disaster
recovery mechanisms.
• Monitoring & Observability:
o At least 3+ years with Splunk
o At least 3+ years with Dynatrace (or equivalent APM tools such as
App
Dynamics, New Relic, Datadog)
o Strong in RCA, system health monitoring, and
-
- end observability.
• Automation & Dev
Ops Tools:
o Proficiency with Cloud Foundry CLI, Jenkins, Splunk, Dynatrace
o Strong experience with CI/CD automation pipelines (Jenkins)
o Knowledge of infrastructure & configuration management and version control
(Git).
• Performance Engineering: Hands-on with load, performance, and capacity testing
(including redline testing).
• Strong ability to communicate methodologies, tools, and analysis techniques
confidently across technical and business stakeholders.
Qualifications
• Bachelor’s or Master’s degree in Computer Science, Information Technology, or related
field (or equivalent practical experience).
• Proven track record in SRE, Dev
Ops, or Infrastructure Engineering roles in
- scale
distributed environments.
• Strong
- solving, analytical, and communication skills.
• Ability to work in a
- paced, collaborative environment and manage competing
priorities.
Nice to Have
• Experience with other cloud platforms (AWS, Azure, GCP).
• Familiarity with container orchestration tools (Kubernetes, Open
Shift).
• Exposure to Infrastructure as Code (Terraform, Ansible, or similar).
• Prior experience in financial services or other highly regulated industries.
- Informações detalhadas sobre a oferta de emprego
Empresa: Moofwd, Inc. Localização: Lisboa
Lisboa, Lisboa, PortugalPublicado: 25. 9. 2025
Vaga de emprego atual
Seja o primeiro a candidar-se à vaga de emprego oferecida!