Service Reliability Engineer (SRE)

Summary

Our team scales, automates and ensures the reliability of diverse services running on over 500 nodes. Hosted Graphite process billions of events each day, and continue to scale up fast. We’re looking for engineers with a strong bias towards making computers do as much of the hard work as possible. More automation = more sleep.

You

Automation is your thing. You don’t want to just maintain systems, but write the software that maintains the systems. You’re into Linux and are not afraid to dig deep when troubleshooting application, system and network problems. You care about distributed systems and like making them scale. You usually have ideas about how to improve things and want to work in an environment where you can make that happen.

We don’t really care about your level of formal education, mathematical skill and so on. We want to see that you have relevant experience, that you like automating away repetitive work, that you have good attention to detail, an aptitude for learning new skills and that you have empathy for your team-mates and our customers.

Examples of typical things our team has worked on:

  • Overhauling our internal monitoring and alerting with a focus on minimising false positives.
  • Automating our deployment process ensuring our deployments are fast, consistent and do not require SRE intervention.
  • Building automation around our load balancing layer to dynamically shape the traffic flow based on load and other factors.
  • Securing our internal traffic by building a full-mesh IPsec network across our server fleet (and building the tooling around it!).

Us

We help companies monitor their servers and applications, allowing them to focus on what they do best and relying on us to keep their data safe and notify them when things start going wrong. The SRE team is responsible for ensuring all our systems (from our data ingestion pipeline to our frontend) scale to meet demand and automating our way around everything that stands in our way.

Most of our application code and automation is written in python, and we rely on open source projects like riak, haproxy, redis, or zookeeper to help us achieve our goals, but are not afraid to write our own tools when it makes sense to do so.

Being in the on-call rotation is part of the job. That usually means not being more than a few minutes from an internet connection. Sometimes it means getting woken up by a phone at 4am. We have weeks go by with zero incidents, and other weeks with several. On-call always sucks in any job so we’re very interested in making it suck as little as possible.
Everyone in the company is very supportive of this goal, all the way up to the founders who’ve been there and done it.

One significant way we make on-call suck less is that every on call shift ends on a Thursday evening with the following Friday off, giving you a three day weekend that’s not counted against your holiday allowance. We want relaxed, well rested ops people.

Location and hours

While we’re a partially remote team our office is in Dublin, Ireland and we’d like you to be based there. We have a bright, spacious office on South William St with views over the city centre and many good lunch and transport options nearby.

  • Our working hours are typically 1000-1800, but it varies by person.
  • Once you’ve settled in you’ll have the opportunity to work from home regularly.

Compensation

A fair salary. What’s a fair salary? Well, that’s hard to say, and highly dependent on your skills and how your skills fit in with what the team is already good at, or not good at. There’s a wide range of salaries we’d consider for this position, but publishing actual figures wouldn’t help much because no two SREs are alike or, strictly speaking, directly comparable. Maybe more time off means more to you. Maybe our human- and health-oriented work culture here is meaningful for you, or maybe a high base salary is more important for you. Let’s talk about this. (And, don’t forget, we said a it would be a fair salary, so here’s a good example – we teach our staff how to ask for the salary increases they deserve – so hopefully you’ll believe us.)

  • 25 days of paid holiday, one day off after every on-call shift plus the usual 9 public holidays.
  • Health insurance for you and your family.
  • Since you’ll be on-call, we’ll pay your phone bill. We also provide a company laptop, typically a Macbook Air, but the brand/model is up for discussion.
  • Free coffee and snacks in the office!
  • Occasional free lunch, nights out and evening meals with the team.

How to apply

To apply, please email jobs@hostedgraphite.com with your CV attached (PDF and TXT files strongly preferred, please) and tell us about yourself. Please tell us why the job is interesting for you, and why your skills and experience make you a good fit. There’s no recruitment company or keyword-filtering tech in between us mangling what you write, we read every application carefully and we’ll try to get back to you quickly.