Site Reliability Engineer

Apply now »

Date: Apr 25, 2024

Location: IN

Company: Responsive

About Responsive

Responsive (formerly RFPIO) is the global leader in strategic response management software, transforming how organizations share and exchange critical information. The AI-powered Responsive Platform is purpose-built to manage responses at scale, empowering companies across the world to accelerate growth, mitigate risk and improve employee experiences. Nearly 2,000 customers have standardized on Responsive to respond to RFPs, RFIs, DDQs, ESGs, security questionnaires, ad hoc information requests and more. Responsive is headquartered in Portland, OR, with additional offices in Kansas City, MO and Coimbatore, India. Learn more at responsive.io.

Essential Responsibilities

Partner with product owners and business SMEs to analyze the business needs and improve support ability, scalability and recovery for the engineered solution.
Ensure that the overall technical solution is aligned with the business needs and operational teams methodologies
Drive the improvement of service availability to reduce the mean time to recovery using automation.
Develop methods for autonomous recovery and self-repairing systems. Ensure the solution is consistent with RFPIO architecture, design and development standards
Coordinate and plan system releases and hotfixes.
Develop methods that allow simplified triage following a set of checklists, run books and standard operating procedures.
Make adjustments to adopt new methodologies that provide the business with increased flexibility and agility
Support software development by providing operational improvements to non-functional requirements.
Develop enhancements to improve service levels by leveraging key performance indicators consisting of monitoring, non-functional testing and availability reports.
Provide a service-focused approach leveraging continuous process improvement.
Participate in chaos testing to improve system resiliency. Mentor other engineers. Provide overall technical leadership to smaller working teams as needed
Stay current with latest development tools, technology ideas, patterns and methodologies; share knowledge by clearly articulating results and ideas to key stakeholders

Education

BS or MS in Computer Science or equivalent industry experience

Experience

At least 3 to 5 years in a Site Reliability Engineering, DevOps, or Infrastructure focused role
Experience supporting internet-facing production services and distributed systems
Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana or Prometheus
Coding experience using a high-level programming languages like: Java, or Python
Automation advocate - you truly believe in removing operational load via software
A strong sense of ownership.
Experience managing, scaling, and troubleshooting Java applications
Familiarity with cloud infrastructure concepts (zones, regions, VPCs, etc)
An understanding of a variety of software service deployment packaging, strategies, and tooling
Working understanding of common authentication schemes, certificates, and securely managing secrets
Capable of designing and implementing automated configuration management processes for repeatable and consistent service deployment

Knowledge, Skills & Ability

Prior experience as an SRE, software engineer, DevOps Engineer, or system administrator
Experience in system automation technology, such as Ansible
Experience in container technologies
Experience using cloud services.

Apply now »