Our client is a division of the global business and financial news and information company, It's a leading market index provider and is the owner and distributor of multiple financial services, a dynamic information network with data, news and analytics including cash, derivatives markets, money markets, government and municipal bonds, currencies, commodities, mortgages, indices, insurance, and legal information.
Join a great company, not merely an individual project
Position overview
We’re seeking a Senior Platform Reliability Engineer to keep our Kubernetes-centric provisioning and Linux estate running smoothly. You’ll coordinate fixes when OS builds or upgrades hit exceptions working across teams to find root causes from logs/metrics and recommend changes.
You’ll automate repeat work (Bash/Python), strengthen runbooks and observability, and document configurations and procedures. You’ll be partnering with hands-on engineers and architects in a highly technical, delivery-focused environment.
Responsibilities
Operate and improve a Kubernetes-centric, open-source platform across provisioning and maintenance workflows.
Coordinate resolution of exceptions in a multi-stage (≈10) provisioning pipeline; engage the right owners with clear, actionable context.
Build and maintain automation and runbooks (Bash/Python) to reduce toil and increase reliability.
Lead triage, log analysis, and root-cause investigation to minimize downtime.
Enhance observability (metrics/logs/traces) and promote SLO-oriented practices.
Operate and tune distributed data stores (e.g., Cassandra) and platform services.
Evolve OS/network provisioning (PXE boot, Subiquity, Foreman, imaging) and server management (BMCs, multi-NIC).
Partner with platform teams to improve automation, performance, security, and cost efficiency.
Document system configurations, procedures, and changes for repeatability.
Requirements
Strong Linux administration and troubleshooting (Ubuntu/Debian preferred).
Production experience with Kubernetes (or similar orchestrator).
Hands-on network/OS provisioning (PXE, Foreman, Subiquity, imaging) and server hardware management (BMCs, multiple NICs).
Proficiency in scripting (Bash, Python) for automation and diagnostics.
Ability to debug across the stack (infrastructure, workloads, automation, networks) and deliver RCA.
Experience with distributed databases (Cassandra or similar).
Familiarity with runbooks, incident management, and SRE/reliability practices.
Clear communicator and process facilitator: knows whom to engage, what signals to collect, and how to drive issues to closure.
CI/CD and IaC mindset (Git and pipelines; Terraform/Ansible a plus).