Site Reliability Engineer - Flutter Functions, Hybrid
Site Reliability Engineer - Flutter Functions, Hybrid
Join to apply for the Site Reliability Engineer - Flutter Functions, Hybrid role at Betfair Romania Development.
4 days ago Be among the first 25 applicants
About Betfair Romania Development
Betfair Romania Development is the largest technology hub of Flutter Entertainment, with over 2, 000 people powering the world’s leading sports betting and i
Gaming brands. Exciting, immersive and safe experiences are delivered to over 18 million customers worldwide, from our office in Cluj‑Napoca. Driven by relentless innovation and commitment to excellence, we operate our own unbeatable portfolio of diverse proprietary brands such as Fan
Duel, Poker
Stars, Sports
Bet, Betfair, Paddy Power, or Sky Betting & Gaming.
Our Values
We are looking for passionate individuals who align with our values and are committed to making a difference.
Win together | Raise the bar | Got your back | Own it | Positive impact
About Flutter Functions
The Flutter Functions division is a key component of Flutter Entertainment, responsible for providing essential support and services across the organization. The division encompasses various corporate functions, including finance, legal, human resources, technology, and more, ensuring seamless operations and strategic alignment throughout the company.
Role Overview
The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of Flutter Entertainment's critical gaming and betting platforms across our global operations. This role combines software engineering expertise with operational excellence to maintain 24/7/365 service availability for millions of customers worldwide. As part of the Service Management Function within Flutter Functions, you will collaborate closely with development teams, infrastructure specialists, and business stakeholders to maintain the
- performance, scalable systems that power our i
Gaming & Sport platforms across multiple markets. Your role will involve implementing automation, monitoring, and incident response procedures to support Flutter's mission of delivering
- class entertainment experiences.
You understand and embrace the philosophy of continuous improvements and have experience of leading teams operating within a CI culture. You don't complain about recurring incidents – you drive process improvements and implement preventative measures to eliminate root causes. You work with internal and external teams to drive best in class to develop
- world solutions and positive user experiences for every interaction.
This role requires exceptional communication skills, as interaction and engagement with senior management during incident escalations and
- incident reviews will be a regular aspect of the role.
Key Accountabilities & Responsibilities
- Maintain 99. 9%+ uptime for critical gaming and betting platforms serving millions of concurrent users
- Design and implement monitoring, alerting, and observability solutions using tools such as Grafana, Splunk & Cloud
Watch - Conduct capacity planning and performance optimization to ensure systems can handle peak loads during major sporting events
- Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services with support from Service Management
- Support Prod
Ops and Service Management teams during P1/P2 incident response, providing technical expertise and facilitating
- functional coordination to minimize customer impact - Collaborate with Service Management on
- incident reviews, contributing technical insights and supporting the implementation of preventative measures to reduce repeat occurrences - Assist in developing and maintaining comprehensive runbooks and incident response procedures in partnership with Service Management teams
- Grafana Stack Management: Design, deploy, and maintain comprehensive Grafana dashboards for
- time system visibility across all Flutter platforms - Advanced Visualization: Create custom Grafana panels and dashboards for business metrics, technical KPIs, and operational insights tailored to different stakeholder needs
- Multi-Source Data Integration: Configure and optimize Grafana data sources including Prometheus, Influx
DB, Elasticsearch, Cloud
Watch, and custom APIs - Alerting Strategy: Implement intelligent alerting rules using Grafana Alerting, reducing alert fatigue while ensuring critical issues are promptly escalated
- Performance Monitoring: Establish application performance monitoring (APM) using Grafana Agent and integrate with existing observability stack
- Custom Metrics Development: Work with development teams to implement custom business and technical metrics that provide actionable insights
- Partner with development teams to improve application reliability and deployment practices
- Mentor junior team members and contribute to the development of SRE practices across Flutter
- Participate in architecture reviews and provide reliability expertise for new system designs
- Document procedures, troubleshooting guides, and system architecture for knowledge sharing
- Look for ways to use AI to triage and investigate alerts allowing for more rapid resolution
- Use AI to find root cause by connecting the dots between code changes, alerts and past incidents
- Investigate the use of AI to provide more collaboration and identify possible resolutions to incidents
Skills, Capabilities & Experience Required
- Cloud Platforms: Advanced experience with AWS, Azure, or Google Cloud Platform services and architecture
- Containerization: Proficiency with Docker and Kubernetes for container orchestration and management
- Programming: Strong scripting abilities in Python, Go, Bash, or Power
Shell; familiarity with Java or. NET advantageous - Monitoring & Observability: Hands-on experience with Prometheus, Grafana, ELK stack, or similar monitoring solutions
- CI/CD: Proficiency with Jenkins, Git
Lab CI, Azure Dev
Ops, or similar continuous integration tools - Database Technologies: Working knowledge of SQL databases (Postgre
SQL, My
SQL) and No
SQL solutions - Networking: Understanding of load balancers, CDNs, DNS, and network security principles
Benefits
- Hybrid & remote working options
- €1, 000 per year for
- development - Company share scheme
- 25 days of annual leave per year
- 20 days per year to work abroad
- 5 personal days/year
- Flexible benefits: travel, sports, hobbies
- Extended health, dental and travel insurances
- Customized
- being programmes - Career growth sessions
- Thousands of online courses through Udemy
- A variety of engaging office events
Disclaimer
We are an inclusive employer. By embracing diverse experiences and perspectives, we create a lasting, positive impact for our employees, customers, and the communities we’re part of. You don't have to meet all the requirements listed to apply for this role. If you need any adjustments to make this role work for you, let us know, and we’ll see how we can accommodate them.
We thank all applicants for their interest; however, only the candidates who best meet the job requirements will be contacted for an interview.
By submitting your application online, you agree that your details will be used to progress your application for employment. If your application is successful, your details will be used to administer your personnel record. If your application is unsuccessful, we will retain your details for a period no longer than three years, to consider you for prospective roles within the company.
Seniority level
- Mid-Senior level
Employment type
- Full-time
Job function
- Engineering and Information Technology
Industries
- Software Development
Referrals increase your chances of interviewing at Betfair Romania Development by 2x
Get notified about new Site Reliability Engineer jobs in Cluj‑Napoca, Cluj, Romania.
- Informações detalhadas sobre a oferta de emprego
Empresa: Betfair Romania Development Localização: Porto
Porto, Porto District, PortugalPublicado: 31. 10. 2025
Vaga de emprego atual
Seja o primeiro a candidar-se à vaga de emprego oferecida!