Before DevOps was born, Google had a problem and didn’t know how to fix it. The company was running large sites but needed to improve them and scale them even more. Its solution? Google tagged a team of software engineers to figure it out and from their efforts came the foundation of Site Reliability Engineering (SRE). Today the software giant defines SRE as “what you get when you treat operations as if it’s a software problem.”
SRE practices were so beneficial they were adopted by other large companies and, over time, enhanced and added to, resulting in a career field that shares many of the traits of today’s DevOps but with a few important distinctions. While both exist in the middle of development and operations, SRE focuses more on automation. Indeed, Google once described the engineer’s purpose as to, “automate their way out of a job.”
Different organizations do SRE differently and may call it Production Engineering or Infrastructure Engineering instead. Whatever it’s labeled, at the end of the day it's an engineer's job to be a team player working continuously to improve website reliability, use incident management KPIs (Key Performance Indicators), write code, build services, and automate manual processes. Since sites stay up 24 hours a day, SREs often work on-call to respond whenever they’re needed.
- Working with a big picture overview of projects
- Serving as a vital bridge between teams
- Improving processes and helping boost organizational profits
- Generous financial compensation
SRE is a well-compensated career field, so expect to earn those salaries by putting in full-time hours! As ParkMyCloud explains it, site reliability essentially equates to business availability. In other words, it’s up to Site Reliability Engineers to minimize costly downtime. That can translate into working after-hours or being on-call to respond rapidly to issues.
- Creating or improving software related to operations and support
- Optimizing and automating processes
- Ensuring release engineering consistency practices
- Addressing and minimizing support escalation
- Capturing and documenting newly-learned information for later reference, such as by creating runbooks. Preventing “siloing” or hoarding of sharable knowledge
- Troubleshooting issues
- Conducting incident reviews (also known as postmortems, retrospectives, or root cause analysis) to determine why a problem occurred without placing blame
- Working on-call for troubleshooting and other incident response issues
- Ensuring compliance with organizational protocols
- Creating action item lists to address problems and mitigate future similar issues within the Software Development Life Cycle
- Ability to facilitate collaboration between teams
- Analytical problem-solving
- Attention to detail
- Customer service
- Highly organized; good time management skills
- Investigative and inquisitive
- Leadership and management skills
- Quality assurance mindset
- Strong communications skills
SREs are required to have several skill sets related to the following:
- Build automation tools
- Build configuration languages
- Distributed systems design
- Domain knowledge related to system administration, development, configuration management, integration testing
- General source code management
- Operating systems
- Package managers
- Software engineering
- Computer systems design agencies
- Governmental/Military agencies
- Higher education institutions
- Media and entertainment
If an organization has a site or sites that are so important they need a Site Reliability Engineer, then expectations are going to run high. According to Netguru, the four main reasons to hire an SRE are to minimize downtime, anticipate and mitigate risks, achieve faster development, and to save money through those and other implemented processes. Clearly Site Reliability Engineers have their hands full, and while they’re trying to juggle the workload they must also keep ahead of changes to the IT world.
Hours can get long when problems occur, not to mention on-all rotations...which means even when you’re off, you’re still technically on. Incident response times can be short, and every employer is different when it comes to compensating for work done after hours. Some may grant Paid Time Off, some might give extra pay, and some might offer a hearty “thanks very much” and nothing else.
SRE is still a relatively new concept for many growing organizations. As a result, one trend is that businesses are still trying to figure out how to best manage it. A major factor driving the push for Site Reliability Engineering is incident resolution, suggesting the notion that companies are simply getting tired of putting out fires and want to get a better handle of them.
Of course, this relieves stress from the management by putting the stress instead onto the SREs. This, in turn, can require employers to find ways to keep those stressed-out workers healthy and well, so the workforce can operate at peak efficiency. Some companies do this better than others, but the trend is to recognize the value of taking care of busy workers who are taking care of business!
The name “Site Reliability Engineer” gives us a few clues about the type of people who work in this field. They enjoy working on websites, an interest most SREs developed in their youth. They’re responsible for ensuring sites are “reliable,” meaning everything works how it should when it should. Thus workers themselves should be reliable, which is another characteristic often honed in one’s early years.
Such persons like to be punctual and prepared and likely did well academically. Indeed, to be an engineer of any type usually requires strong academic aptitudes, particularly in math and science, of course. One of the interesting things about this field, though, is how many soft skills come into play.
An SRE needs to be a people person, someone comfortable working with teams, and able to foster collaboration between those teams. As a result, they may have held leadership positions in school, or perhaps simply had a lot of siblings to contend with! SREs are efficiency experts, trained to find ways to make things better by studying problems and determining solutions based on their research. This requires a creative yet analytical mindset as if both hemispheres of the brain are working in tandem. It’s possible many SREs are ambidextrous or adept at playing musical instruments.
- Site Reliability Engineers need a bachelor's degree, preferably in Computer Science or a related area
- There isn’t a set path to becoming an SRE. Some workers enter through an internship; others might do a bootcamp, then develop skills while doing other IT jobs while practicing other skills on their own
- Ample work experience is a key requirement of most employers (many SRE employees first work in DevOps, sysadmin, or as developers or software engineers)
- Classes to become familiar with Java, Python, Ruby, or C++, as well as Linux, Kubernetes, and MySQL
- Courses to build soft skills in English, writing, speaking, teamwork, and leadership
- Optional certifications include:
- American Society for Quality’s Reliability Engineer Certification
- DevOps Institute’s SRE Foundation Certification
- CompTIA’s Linux+ Certification
- Learn on your own by taking courses on:
- edX - Introduction to DevOps and Site Reliability Engineering
- Lynda (from LinkedIn) - DevOps Foundations: Site Reliability Engineering
- Udemy - An Introduction to Reliability Engineering
- Coursera - Site Reliability Engineering: Measuring and Managing Reliability
- Note, the same course also offered at Pluralsight
- Much of what you’ll need to know to be a successful Site Reliability Engineer will be learned outside of your college program!
- Ideally, look for programs offering courses in the areas listed above
- Read faculty bios to see what their areas of expertise and backgrounds are
- What types of student clubs and organizations are available? Many soft and technical skills are most effectively learned through ample peer interactions
- Ensure the school is accredited
- Look for programs that publish post-graduation job stats and have a solid track record
- Weigh the pros and cons of enrolling in an online program. On-campus engagement is very beneficial for building soft skills, so sometimes a hybrid program is beneficial
U.S. News & World Report’s Best Computer Science Programs can help you get started, but don’t rely only on one ranking. You don’t want to miss out on good opportunities, so we recommend considering lists such as Great Value College’s 50 Great Affordable Colleges for Computer Science and Engineering for 2020 or Best Value School’s Top 25 Computer Science Programs With the Best Return on Investment.
College can get outrageously expensive, but keep in mind that many employers are very practical. They may be more interested in your hard technical skills than which school you graduated from. In other words, simply having a degree from a costly private college isn’t going to guarantee a job in this line of work. Focus on taking specific classes needed to build skills, and get as much hands-on experience as possible.
- As mentioned, there’s no single path to becoming an SRE, so map out a few options
- Look at job postings from companies you’d like to work for. Pay attention to required work and academic experiences, then reverse-engineer a career path to get there
- In high school, build a solid foundation by taking as many IT electives as possible
- Get as much hands-on skills practice as you can! Take courses related to the items in our Education and Training above
- Don’t forget to work on your writing. Technical writing is important but you’ll also need to translate complex topics into layman’s terms
- SREs need good teamwork and leadership skills. These are often neglected traits you’ll be expected to have later, so look for ways to develop them early on
- Nothing beats having an experienced mentor so reach out to alumni or faculty for advice
- Teach others. Teaching facilitates new learning experiences for both parties
- Read and join discussions on Quora, Reddit, Dev.to, and other sites
- When your skills are good enough, get some paid experience on Upwork
- Find internships on Indeed, or ask your college program if they have opportunities
- Be a leader in IT-related clubs, and build a vast network of peers and associates!
- Put the word out! The majority of jobs are now found through networking
- Take the TripleByte DevOps screening test. If you pass, you will get an interview with employers in their network.
- Look for openings on Indeed, Monster, USAJobs, ZipRecruiter, LinkedIn, and Glassdoor
- Find out what employers look for! Usenix has a downloadable .pdf listing insider tips on hiring SREs
- Some employers train their SREs internally, so you may want to start out in one job but with a plan to work your way up within the company
- Get an internship. They don’t always pay well but you’ll get your foot in the door and they can lead to full-time jobs
- The jury is out on how useful job fairs are, but industry-specific fairs can certainly give you some exposure to what opportunities exist and offer a chance to chat with workers
- Have your resume in order. Job Hero has some great Site Reliability Engineer resume templates to steal ideas from
- Bring in a professional resume writer (or editor) to punch up your doc and make it the best it can be. But remember, tailor each resume to the specific job you’re applying for
- Study GitHub’s massive database of resources and interview questions!
- A lot depends on the size of the organization. Some companies promote from within; others might want external candidates. Promotion opportunity discussions should be had with your supervisor early on
- Be proactive. Train yourself, take courses, keep learning. When there’s a new trend in technology, find out everything you can about it and be a subject matter expert
- Display loyalty to your company and become a trusted, invaluable asset worthy of increased responsibility. Behave in a manner that indicates you’re ready to advance
- Always remember the soft skills. Even the most technically-skilled employee will have a hard time moving up if they don’t get along well with others
- Be a boss. Show your competency and leadership potential. An SRE must be able to direct others in a collaborative but decisive (and when needed, firm) fashion
- Prove you are reliable. Be punctual, and if you’re on-call respond to the incident quickly, perform the work diligently, and find ways to mitigate future similar problems
- Advanced Bash-Scripting
- Awesome Python
- Beej’s Guide to Network Programming
- Command Challenge
- Cyber Aces
- DevOps BootCamp
- Eli the Computer Guy
- Git Immersion
- Intro to SQL: Querying and managing data
- MIT’S Operating System Engineering
- MongoDB University
- Ops School
- Over the Wire
- Puppet Learning
- SRE Weekly
- Sysadmin Casts
- The Big Blog Post of Information Security Training Materials
- The Geek Stuff
- The Google SRE Book
- The Open Guide to Amazon Web Services
- The System Design Primer
- The Unix Workbench
- Unix Toolbox
- Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems, by Heather Adkins, Betsy Beyer, et al.
- Operating Systems: Three Easy Pieces, by Remzi Arpaci-Dusseau and Andrea Arpaci-Dusseau
- Practical Site Reliability Engineering, by Pethuru Raj Chelliah, Shreyash Naithani, et al.
- Site Reliability Engineering: How Google Runs Production Systems, by Niall Richard Murphy, et. al.
- The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win, by Gene Kim, Kevin Behr, et al.
Site Reliability Engineering can be a thrilling career field with a ton of responsibility. However, the path to breaking in is not always cut-and-dry. Many people start off in other areas, and sometimes they end up staying in those areas. A few “Plan B” job options include::
- Back-End Developer
- Computer and Information Systems Manager
- Computer Programmer
- Computer Support Specialist
- Computer Systems Analyst
- Database Administrator
- Front-End Developer
- Full-Stack Developer
- Information Security Analyst