Be wary of WhatsApp messages impersonating Jobline Resources's staff offering job opportunities. Those who encounter suspicious messages can contact Jobline at +65 6339 7198

Responsibilities

  • Design, implement, and maintain Datadog-based observability solutions across infrastructure, platforms, and applications.
  • Develop and optimize dashboards, monitors, and alerts to support proactive detection and triage of performance and reliability issues.
  • Integrate custom telemetry pipelines (metrics, logs, traces, events) aligned with OpenTelemetry and platform architecture standards.
  • Manage instrumentation strategies to ensure accurate and consistent coverage across services.
  • Apply SRE principles to improve service reliability, availability, and performance.
  • Define and track SLIs, SLOs, and SLAs for critical systems, and build feedback loops to continuously enhance service health.
  • Automate manual operational processes using Python, Terraform, or CI/CD tooling.
  • Collaborate with development and platform teams to identify resilience patterns and embed observability by design.
  • Serve as the subject matter expert (SME) for Datadog — advising on advanced configurations, integrations, and performance optimization.
  • Enable distributed tracing, APM, RUM, and synthetics capabilities to support end-to-end visibility.
  • Implement and maintain Datadog Terraform configurations, templates, and governance models for enterprise consistency.
  • Conduct performance tuning and cost optimization for Datadog usage across global environments.
  • Partner with the Operations and Platform teams to analyze incident patterns and provide root cause insights through observability data.
  • Lead post-incident reviews and recommend observability-driven improvements to prevent recurrence.
  • Build automation and correlation mechanisms for real-time alert enrichment and contextual diagnostics.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or a related field.
  • 5+ years of experience in observability engineering or SRE roles within large-scale distributed systems.
  • Deep, hands-on expertise with Datadog, including APM, Logs, Metrics, RUM, and Synthetics.
  • Strong proficiency in:
    • Infrastructure as Code (IaC): Terraform
    • Automation: Python, Bash, or similar scripting languages
    • CI/CD pipelines: Jenkins, GitLab, or GitHub Actions
  • Strong understanding of monitoring patterns, tracing, and event correlation for complex systems.
  • Familiarity with OpenTelemetry and modern observability frameworks.
  • Experience supporting multi-cloud environments (AWS, GCP, Azure).
  • Familiarity with container orchestration (Kubernetes, ECS) and service mesh observability.
  • Understanding of data visualization and analytics for operational reporting.
  • Exposure to AI-driven observability enhancements or integration with LLM-based insights (a plus).
  • Certification in Datadog, AWS, or GCP is advantageous.

Shortlisted candidates will be offered a 1 Year agency contract employment.