DevOps Engineer
Dev
Ops Support Engineer
100% Remote – Western Europe/Portugal/UK
12+ Month Long-Term Contract
Dev
Ops Team Service Reliability Engineer
I. Description of Services and Milestones
The Service Reliability Engineer (SRE) consultant will provide
-
- day support, monitoring, troubleshooting, and fixing of issues to ensure the reliability and performance of the MD3 infrastructure. They will work alongside and under the technical direction of Lilly staff and will be located in the Eastern time zone.
Scope of responsibilities:
Monitoring and Support:
- Continuously monitor the health and performance of the MD3 infrastructure (data observations, HPC, Live
Design tasks) - Utilize monitoring tools (Service
Now, Splunk, Grafana) to detect and respond to incidents in
- time. - Perform regular job queue checks and maintenance activities to ensure optimal performance.
- Monitor the MD3 dashboard and community chats/channels for any issues or alerts.
Troubleshooting and Fixing:
- Diagnose, troubleshoot, and potentially resolve technical issues related to the MD3 infrastructure.
- Collaborate with Dev
Ops engineers and other technical teams to address and fix incidents. - Document and communicate the root cause of incidents and the steps taken to resolve them.
Automation and Improvement:
- Develop and implement automation scripts to streamline monitoring and troubleshooting processes.
- Identify areas for improvement in the infrastructure and propose solutions to enhance reliability and performance.
- Participate in
- incident reviews to identify and address any gaps in the monitoring and support processes.
Collaboration and Communication:
- Work closely with the Dev
Ops team to ensure alignment with business goals and research needs. - Communicate effectively with stakeholders to provide updates on incidents and resolutions.
- Participate in regular standups and scrums to discuss ongoing issues and progress.
- Build and share
- weekly reports on the status and performance of the MD3 infrastructure. -
Knowledge Management:
- Develop and maintain knowledge articles for the help desk (Service
Now) and FAQ for users. - Ensure that all documentation is
-
- date and easily accessible for the support team and
- users.
Service Level Agreements (SLAs):
- Identify and establish SLAs based on current ITSM practices across for incidents and problems.
- Ensure that all incidents and problems are resolved within the defined SLAs.
- Performance and Infrastructure Capacity Planning
- performance optimization: Fine-tuning applications and infrastructure to ensure systems meet performance benchmarks.
- Capacity Planning: Anticipating growth needs to scale infrastructure and prevent overutilization or underutilization of resources.
Documentation:
- Create runbooks for critical alerts
- Informações detalhadas sobre a oferta de emprego
Empresa: Aptonet Localização: Bragança
Bragança, Bragança District, PortugalPublicado: 22. 8. 2025
Vaga de emprego atual
Seja o primeiro a candidar-se à vaga de emprego oferecida!