At RevenueCat, we make selling subscriptions in your mobile app easy. We launched as part of Y Combinator's summer 2018 batch and today are handling subscriptions for more than 10 million mobile subscriptions across thousands of apps. We are a mission driven, remote-first company that is building the foundation of mobile subscription infrastructure. Top companies like VSCO, Notion, WidgetSmith, Buffer, and Fishbrain count on RevenueCat to power their subscriptions at scale.
Our 30 team members (and growing!) are located all over the world, from San Francisco to Madrid to Taipei, and we're proud to be a remote-first company. We're a close-knit, product-driven team, and we love our core values: Always be Shipping, Own it, Be Customer-Obsessed, and Be Balanced.
This person will be the first member of RevenueCat's Reliability Engineering team. Up until now, reliability efforts have been performed by all the product engineers. It's time to start a team solely focused on this area as reliability is paramount to RevenueCat.
We want to bring somebody onboard that is passionate about reliability, scalability and understanding the limits of computers and people. We need somebody that will help the rest of the product engineers to learn reliability best practices and processes. This person should be excited about all the technical challenges we will face growing our API throughput from 400K requests per minute to millions of requests per minute.
- You have 5+ years of experience as a Software or Platform Engineer and are comfortable writing and analyzing code.
- You understand data structures, can investigate incidents, and differentiate memory, I/O and CPU bottlenecks.
- You have experience designing, maintaining and rolling out large and growing distributed systems.
- You are extremely curious and excited about finding out how many more requests we can handle without any downtime.
- You hate manual processes and love to automate all the things and reduce toil.
Preferred but Not Required:
- Experience building and maintaining systems to monitor and improve availability and scalability
- Experience with a container orchestration system (Kubernetes, AWS ECS, Nomad,...)
- Great communication skills and eager to educate the team about best reliability practices
- Experience with AWS, Terraform and PostgreSQL
- Experience with highly available, high throughput, REST APIs
In the first month, you'll:
- Work with the CTO to learn about our current infrastructure and its evolution
- Work with our product engineers to learn about the new product efforts and their infrastructure needs
- Learn about our product, API, database and what is computationally cheap vs expensive
- Learn about our current practices, alarms, monitoring tools and on-call rotations
In the first three months, you'll:
- Detect our current bottlenecks, risks and single points of failure
- Own and tweak our alarms to guarantee proper noise/signal ratio
- Own blameless post-mortem analysis and action items coordination
- Manage the on-call rotation
- Help define SLOs
In the first six months, you'll:
- Own risk assessment, disaster planning and response strategies
- Be obsessed about our uptime
- Detect our blindspots and add observability
- Work closely with product engineers to design reliable rollouts of new features. You will contribute to writing and reviewing code as well as participating in architectural discussions.
Within a year, you'll:
- Be the most knowledgable person in the company about our infrastructure, and the main advocate of building a culture of security and reliability
- Help recruit and build our SRE team
- Educate the whole team about best practices and onboard new engineers to on-call rotation
- Be involved in the process of building new product features, from the design to rollout, maintenance and scaling
What We Offer:
- $150,000-$170,000 USD + competitive equity across all geographies
- Generous stipend for home workspace
- Comprehensive medical, dental, and vision coverage for US team members
- Matched 401K plans for US team members
- Open vacation policy