Site Reliability Engineer - Lisboa

Site Reliability Engineer
Lisboa
Lisboa, Lisboa, Portugal

About the Role

We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our

team. The ideal candidate will have strong expertise in managing
- scale distributed

systems, driving automation, and ensuring system reliability, performance, and scalability. You

will play a critical role in monitoring, diagnosing, and improving our infrastructure while

collaborating with
- functional teams to deliver robust and resilient solutions.

Responsibilities

• Design, implement, and manage reliable, scalable, and secure distributed systems in

production.

• Monitor system performance, diagnose complex issues, and ensure quick recovery

during outages.

• Drive observability practices by building dashboards, alerts, and monitoring solutions

using Splunk, Dynatrace (or equivalent APM tools), and other monitoring platforms.

• Collaborate with development, QA, and operations teams to improve system

architecture, CI/CD pipelines, and automation.

• Lead root cause analysis (RCA) of incidents and recommend improvements to enhance

resilience and performance.

• Implement best practices for disaster recovery, graceful degradation, and capacity

planning.

• Conduct performance, load, and redline testing to proactively identify bottlenecks and

scalability issues.

• Contribute to infrastructure and configuration management processes and support

version control best practices (Git).

• Act as a subject matter expert for SRE methodologies, tools, and industry best

practices.

Technical Skills & Experience

• Experience: Minimum 7+ years managing, diagnosing, and debugging
- scale

distributed systems in production.

• Core Systems Expertise: Web servers, relational and
- relational databases,

caching, Pub/Sub systems, containers (Docker/Kubernetes), resiliency & disaster

recovery mechanisms.

• Monitoring & Observability:

o At least 3+ years with Splunk

o At least 3+ years with Dynatrace (or equivalent APM tools such as

App
Dynamics, New Relic, Datadog)

o Strong in RCA, system health monitoring, and
-
- end observability.

• Automation & Dev
Ops Tools:

o Proficiency with Cloud Foundry CLI, Jenkins, Splunk, Dynatrace

o Strong experience with CI/CD automation pipelines (Jenkins)

o Knowledge of infrastructure & configuration management and version control

(Git).

• Performance Engineering: Hands-on with load, performance, and capacity testing

(including redline testing).

• Strong ability to communicate methodologies, tools, and analysis techniques

confidently across technical and business stakeholders.

Qualifications

• Bachelor’s or Master’s degree in Computer Science, Information Technology, or related

field (or equivalent practical experience).

• Proven track record in SRE, Dev
Ops, or Infrastructure Engineering roles in
- scale

distributed environments.

• Strong
- solving, analytical, and communication skills.

• Ability to work in a
- paced, collaborative environment and manage competing

priorities.

Nice to Have

• Experience with other cloud platforms (AWS, Azure, GCP).

• Familiarity with container orchestration tools (Kubernetes, Open
Shift).

• Exposure to Infrastructure as Code (Terraform, Ansible, or similar).

• Prior experience in financial services or other highly regulated industries.

Informações detalhadas sobre a oferta de emprego

Empresa:	Moofwd, Inc.
Localização:	Lisboa Lisboa, Lisboa, Portugal
Publicado:	25. 9. 2025 Vaga de emprego atual

Responder ao anúncio
Seja o primeiro a candidar-se à vaga de emprego oferecida!

Site Reliability Engineer Lisboa