Lead Site Reliability Engineer - DevOps
3 days ago Be among the first 25 applicants
We are looking for a Lead Site Reliability Engineer to enhance a global execution platform, delivering robust solutions to trading desks and clients.
You will collaborate with expert teams, advancing your expertise in system administration, monitoring, and low‑latency technologies. Join us to contribute to cutting‑edge financial technology innovations.
Note that working on‑site at the client's Lisbon office for 2-3 days per week is required.
Responsibilities
- Design and enforce monitoring, alerting, and incident management strategies
- Automate repetitive tasks and workflows to increase operational efficiency
- Work alongside software engineering teams to build and launch scalable, dependable systems
- Execute production deployments carefully to preserve platform stability
- Handle incident management with thorough analysis and reporting to maintain service quality
- Engage in on‑call duties to support essential systems and services
- Communicate clearly with colleagues to swiftly resolve technical problems
- Maintain up‑to‑date documentation for operational workflows and system settings
- Drive continuous improvements in system reliability and efficiency through proactive initiatives
Requirements
- Deep understanding of Unix/Linux operating systems and networking with over 5 years experience
- Proficiency in Unix/Linux shell scripting and programming languages including Python, Perl, C, C++, or Java
- Experience with monitoring and observability solutions such as ITRS Geneos, Dynatrace, Prometheus, and Grafana
- Strong troubleshooting skills for complex system issues
- Experience in environments with high availability and heavy traffic
- Bachelor’s or Master’s degree in IT engineering or a related discipline
- Ability to collaborate effectively within a team and adapt to evolving environments
- Self‑driven with excellent problem‑solving capabilities and thorough issue tracking
- Excellent written and verbal communication abilities with English proficiency at B2+ level
Nice to have
- Familiarity with log analysis tools like Splunk, ELK, Graylog, or Loki
- Knowledge of network monitoring solutions such as Corvil
- Experience with relational databases including Oracle, Postgre
SQL, My
SQL/Maria
DB, or KDB/q - Understanding of messaging platforms like IBM MQ, Tibco, Solace, LBM, or Kafka
- Experience with Infrastructure as Code tools such as Ansible or Terraform
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the Linked
In Learning library and 22, 000+ courses - Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award‑winning culture recognized by Glassdoor, Newsweek and Linked
In
Seniority level
- Mid‑Senior level
Employment type
- Full‑time
Job function
- Engineering, Information Technology, and Business Development
Industries
- Software Development, IT Services and IT Consulting, and Banking
Referrals increase your chances of interviewing at EPAM Systems by 2x
Get notified about new Site Reliability Engineer jobs in Lisbon, Lisbon, Portugal.
- Informações detalhadas sobre a oferta de emprego
Empresa: EPAM Systems Localização: Lisboa
Lisboa, Lisboa, PortugalPublicado: 31. 10. 2025
Vaga de emprego atual
Seja o primeiro a candidar-se à vaga de emprego oferecida!