Site Reliability Engineer

Date: Apr 25, 2024

Location: IN

Company: Responsive

About Responsive

Responsive (formerly RFPIO) is the global leader in strategic response management software, transforming how organizations share and exchange critical information. The AI-powered Responsive Platform is purpose-built to manage responses at scale, empowering companies across the world to accelerate growth, mitigate risk and improve employee experiences. Nearly 2,000 customers have standardized on Responsive to respond to RFPs, RFIs, DDQs, ESGs, security questionnaires, ad hoc information requests and more. Responsive is headquartered in Portland, OR, with additional offices in Kansas City, MO and Coimbatore, India. Learn more at responsive.io.

Essential Responsibilities

  • Partner with product owners and business SMEs to analyze the business needs and improve support ability, scalability and recovery for the engineered solution.
  • Ensure that the overall technical solution is aligned with the business needs and operational teams methodologies
  • Drive the improvement of service availability to reduce the mean time to recovery using automation.
  • Develop methods for autonomous recovery and self-repairing systems. Ensure the solution is consistent with RFPIO architecture, design and development standards
  • Coordinate and plan system releases and hotfixes.
  • Develop methods that allow simplified triage following a set of checklists, run books and standard operating procedures.
  • Make adjustments to adopt new methodologies that provide the business with increased flexibility and agility
  • Support software development by providing operational improvements to non-functional requirements.
  • Develop enhancements to improve service levels by leveraging key performance indicators consisting of monitoring, non-functional testing and availability reports.
  • Provide a service-focused approach leveraging continuous process improvement.
  • Participate in chaos testing to improve system resiliency. Mentor other engineers. Provide overall technical leadership to smaller working teams as needed
  • Stay current with latest development tools, technology ideas, patterns and methodologies; share knowledge by clearly articulating results and ideas to key stakeholders

Education

  • BS or MS in Computer Science or equivalent industry experience

Experience

  • At least 3 to 5 years in a Site Reliability Engineering, DevOps, or Infrastructure focused role
  • Experience supporting internet-facing production services and distributed systems
  • Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana or Prometheus
  • Coding experience using a high-level programming languages like: Java, or Python
  • Automation advocate - you truly believe in removing operational load via software
  • A strong sense of ownership.
  • Experience managing, scaling, and troubleshooting Java applications
  • Familiarity with cloud infrastructure concepts (zones, regions, VPCs, etc)
  • An understanding of a variety of software service deployment packaging, strategies, and tooling
  • Working understanding of common authentication schemes, certificates, and securely managing secrets
  • Capable of designing and implementing automated configuration management processes for repeatable and consistent service deployment

Knowledge, Skills & Ability

  • Prior experience as an SRE, software engineer, DevOps Engineer, or system administrator
  • Experience in system automation technology, such as Ansible
  • Experience in container technologies
  • Experience using cloud services.