Site Reliability Engineer

Starling Bank
06 Sep 2017
19 Sep 2017
Contract Type
Our SRE team proactively ensures the stability, resilience and scale of our services by automation, testing and engineering. We build on expertise from systems / operations (OS & DB), cloud infrastructure (AWS), pipeline / release engineering (TeamCity), software development and stress / load testing to make sure our services are available 24 hours a day, seven days a week.We're looking for engineers to join the team with a passion for infrastructure and delivery who are equally happy:working with developers to ensure a principled approach to delivering change in a safe and secure wayworking with third parties to ensure our comms are reliableworking with other SREs to hit our service level objectives and prove our systems and environmentsThe ideal candidate will strive for continual improvement by contributing and assessing new ideas and innovations to meet short term and longer term goals whilst at the same time accepting responsibility for day-to-day health of our environments.ResponsibilitiesYou will work in our SRE team, or embedded in our engineering teams, to deliver our SRE mission:Change management and delivery pipeline into productionEnsure safety, predictability, repeatability and auditability of all build and deploy processesEnabling ownership by platform and application engineers of tech-specific build plansEnabling maximum velocity without violating service level objectivesMonitoring, alerting, SLO trackingTo proactively manage delivery of service level objectivesDetection / early warning / self-healOn-call managementFacilitate emergency / incident responseCreate, maintain and test for recovery (backup & restore, infra automation etc.)Provisioning / automating deployment infrastructureDemand forecasting and capacity managementEfficiency and cost managementPerformance and scalability of the servicesOwnership of some cross-cutting implementation like logs / metrics infrastructureAutomation of security checks, break-glass procedures, etc.Provide level of audit and control to security personnelSkillsNetworking, Java, Change Management, Linux, Monitoring, EC2, JVM, Continuous Delivery, S3, Cloudformation, ELB, disk I/O, SLO tracking, asgSectorsRequirementsThe ideal candidate will some or all of:Software development experience: ideally Java / JVM but not essentially; javascript, python, bash all beneficialAWS expertise; familiarity with core services (S3, EC2, ELB, ASG) and CloudFormationGood understanding of traditional ops areas of expertise: Linux, Disk I/O, Networking, VPNsGood familiarity with docker and container ecosystemContinuous delivery - principles and pragmatics of dealing with build pipelines, artefact repositories, zero-downtime deployment and so onProving resilience via failure injection (chaos monkey), scalability via load and stress testingExperience with any of the following: CoreOS, ELK, Prometheus, ElasticSearch, PostgreSQL, PagerDuty, Gatling, JMeter, KubernetesSome understanding of iOS or Android also beneficialSensitivity to (but also boldness to influence) culture and behaviour across an organisation?Our BenefitsOwnership of your work Amazing teammates Offices near Liverpool St (complete with table football and mariokart) Breakfast and Lunch club (office lunches on Wednesdays, breakfasts on Fridays. There's always something for everyone - veggies, vegans and the allergy prone we got you.) Friday beers (no start-up worth its salt would ever dream of ending the week without a cold one.) Access to a team Amazon Account Macbooks all-round 25 days of holiday Potential for equity incentives