Service Reliability Engineer (SRE)

Summary

Our team scales, automates and ensures the reliability of diverse services running on over 500 nodes. We process billions of events each day, and continue to scale up fast. We’re looking for engineers with a strong bias towards making computers do as much of the hard work as possible. More automation = more sleep.

You

Automation is your thing. You don’t want to just maintain systems, but write the software that maintains the systems. You’re into Linux and are not afraid to dig deep when troubleshooting application, system and network problems. You care about distributed systems and like making them scale. You usually have ideas about how to improve things and want to work in an environment where you can make that happen.

We don’t really care about your level of formal education, mathematical skill and so on. We want to see that you have relevant experience, that you like automating away repetitive work, that you have good attention to detail, an aptitude for learning new skills and that you have empathy for your team-mates and our customers.

Examples of things our team has been working on lately:

  • Overhauling our internal monitoring and alerting with a focus on minimising false positives.
  • Automating our deployment process ensuring our deployments are fast, consistent and do not require SRE intervention.
  • Building automation around our load balancing layer to dynamically shape the traffic flow based on load and other factors.
  • Securing our internal traffic by building a full-mesh IPsec network across our server fleet (and building the tooling around it!).

Us

We help companies monitor their servers and applications, allowing them to focus on what they do best and relying on us to keep their data safe and notify them when things start going wrong. The SRE team is responsible for ensuring all our systems (from our data ingestion pipeline to our frontend) scale to meet demand and automating our way around everything that stands in our way.

Most of our application code and automation is written in python, and we rely on open source projects like riak, haproxy, redis, or zookeeper to help us achieve our goals, but are not afraid to write our own tools when it makes sense to do so.

Being in the on-call rotation is part of the job. That usually means not being more than a few minutes from an internet connection. Sometimes it means getting woken up by a phone at 4am. We have weeks go by with zero incidents, and other weeks with several.

On-call always sucks in any job so we’re very interested in making it suck as little as possible. Everyone in the company is very supportive of this goal, all the way up to the founders who’ve been there and done it.

One significant way we make on-call suck less is that every on call shift ends on a Friday morning with the rest of the day off, giving you a three day weekend that’s not counted against your holiday allowance.

We want relaxed, well rested ops people.

Location and hours

While we’re a partially remote team our office is in Dublin, Ireland and we’d like you to be based there. We have a bright, spacious office on South William St with views over the city centre and many good lunch and transport options nearby.

Our working hours are typically 1000-1800, but it varies by person.

Once you’ve settled in you’ll have the opportunity to work from home regularly.

Compensation

  • A competitive salary. 25 days of paid holiday, one day off after every on-call shift plus the usual 9 public holidays.
    Health insurance for you and your family
  • Since you’ll be on-call, we’ll pay your phone bill. We also provide a company laptop, typically a Macbook Air, but the brand/model is up for discussion.
  • Free coffee and snacks in the office!
  • Occasional free lunch, nights out and evening meals with the team.

Tell jobs@hostedgraphite.com about why your skills, experience and personality make you a good fit. If you want to submit a CV, make sure it’s TXT or PDF.