Manager Notes:
- Automation of Remediation Tasks Experience a Must
- Linux Administration
- AWS
- Demonstrated ability to write programs using Java based technologies/Scala.
- Experience in shell scripting using Python or Shell
- Must have experience managing cloud production distributed application stack in AWS/Azure/Google cloud
- Docker and Container Orchestration experience
- Hands on Experience using configuration management tools like Ansible, Chef or Puppet is a big plus
Site Reliability Engineering is a pivotal role in the success of this project. Our SREs ensure that the platform software and platform automation is robust, reliable, and scalable.
As a Site Reliability Engineer, you will:
- triage and remediate production incidents
- provide engineering-level support for issues reported by users
- work closely with development teams to improve the observability of the system
- aggressively automate remediations for common problems
- build tools to facilitate rapid triage and troubleshooting
- build tools to enable continuous monitoring of production systems
- measure service level objectives
- define and improve service level objectives