Responsibilities
-
Track record of succinctly documenting processes, procedures, and best practices
-
Integration expert: able to wire systems together via their APIs
-
Fluent coder in Python, Java, Bash, C#, PowerShell or similar
-
Comfortable with Linux and Windows
-
Experience working on complex 24x7 available distributed systems
-
Familiarity with build tools, particularly Jenkins
-
Understanding and experience with Build Repositories, ideally Artifactory
-
Possession of a deep knowledge of developer workflows with Git (BitBucket)
-
Experience setting up monitoring using ScienceLogic, NewRelic or Solarwinds
-
Experience transforming Big Data into Operations Insights, ideally with Splunk
-
Comfortable leveraging AI and Machine Learning for Predictive Analysis of Failures and Correlation, ideally with Splunk
-
Experience migrating systems between technologies
- Participation in on-call rotation
- Runtime Infrastructure : Docker, Kubernetes, Lambdas
- Storage Systems : DynamoDB, MongoDB, Redis, Hadoop, Memcached
- Messaging : Kafka, Rsyslog, Logstash, Splunk
- Programming/Scripting : Java / Jetty, Python, Scala, PowerShell, C#, Bash
- Build Tools/Repositories : Jenkins, Artifactory
- Web Services Framework: Django, Flask
- API Framework: Gunicorn
Qualifications
-
5+ years of relevant experience focused on site reliability, TechOps, DevOps, systems administration, application development, build, release and deployment
-
Experience deploying systems with Kubernets, Docker, or Azure Containers
-
Automation experience with ScienceLogic or Ansible
-
Some experience with monitoring and anomaly detection systems
-
Hands-on experience on monitoring tools such as New Relic, Splunk, SignalFx, Solarwinds, etc.
-
Well versed with ITIL concepts (Event, Incident, Knowledge, Change, Problem Management)
-
Familiarity with Chatbots such as Slackbot
-
End user knowledge on ITSM tools such as ServiceNow