Senior Site Reliability Engineer

🇺🇸Block

Bay Area, CA, United States of America0 applicants

Full TimeSenior

Job Description

Block builds simple, powerful tools that make progress towards an economy that’s truly open to all. Each of our brands unlocks different aspects of the economy for more people. Square makes commerce and financial services accessible to sellers. Cash App is the easy way to spend, send, and store money. Afterpay is transforming the way customers manage their spending over time. TIDAL is a music platform that empowers artists to thrive as entrepreneurs. Bitkey is a simple self-custody wallet built for bitcoin. Proto is a suite of bitcoin mining products and services. Together, we’re helping build a financial system that is open to everyone. Join us. The Role As a member of the SRE team, you will proactively and reactively improve the reliability of Block's platform and critical infrastructure. You are metrics-driven, systems-oriented, and focused on building distributed platforms that enable safe, scalable product development. You will leverage and continuously improve AI-driven tooling and automation to enhance observability, accelerate incident detection and response, and reduce operational toil. This includes applying AI to incident analysis, alert tuning, and operational workflows. You will participate in primary platform oncall (12 hours per day, one week every few weeks, depending on team size), supporting Block's most critical (Tier 0) services. In this role, you will lead incident command, coordinate mitigation, and drive effective escalation during high-severity events. You Will Build and extend platforms to improve system reliability Work on team goals that encompass reliability for the entire company Standardize reliability tools across multiple platforms and organizations Triage, coordinate, and lead stabilization of sev 0–1 incidents Serve as primary oncall, maintaining structured escalation paths and exercising leadership escalation Drive platform-wide reliability improvements, shared operational tooling, and deploy-safety patterns Use AI-driven systems to improve signal detection, reduce noise, and accelerate root cause analysis Design and implement safe deployment patterns (progressive delivery, automated rollback, guardrails) You Have Drive to root cause systems with many moving parts and take the necessary steps to fix them Demonstrated technical initiative and leadership on previous projects, especially those with a backend/platform focus Familiarity with AI-driven tooling for observability, incident analysis, or automation A mindset that naturally reaches for AI to accelerate problem-solving and reduce toil Experience running production oncall for high-availability systems Strong incident management skills — structured triage, mitigation under pressure, blameless postmortems Fluenc

Read original posting

Required Skills

GoScalaRReactRailsSREObservability

Block