Design, implement, and maintain scalable and reliable infrastructure solutions to support our applications and services.
Monitor the health and performance of systems and services using Datadog, Sumo Logic, Grafana, and other monitoring tools.
Proficiency in Terraform
Address issues, and implement preventive measures to minimize downtime and service disruptions.
Automate repetitive tasks, streamline operational workflows, and improve efficiency through infrastructure as code (IaC) and automation tools.
Collaborate with development teams to ensure that applications are designed and implemented with reliability and scalability in mind.
Participate in on-call rotation and respond to incidents and emergencies in a timely manner, effectively triaging and resolving issues to minimize impact on customers.
Conduct post-incident reviews, analyse root causes, and implement corrective actions to prevent recurrence.
Continuously evaluate and improve monitoring, alerting, and logging solutions to enhance visibility into system behaviour and performance.
Stay informed of industry trends and best practices in site reliability engineering and contribute to the adoption of new technologies and methodologies.
Altair is a global technology company that provides software and cloud solutions in the areas of product development, high-performance computing (HPC), and data analytics. Altair enables organizations across various industries to compete more effectively in a connected world while creating a more sustainable future. For more information, visit https://altair.com/.