Victor E. Martínez Rubio

Supercharged GitHub Actions monitoring with OpenTelemetry at Elastic by Victor E. Martínez Rubio

Ready to observe your GitHub Actions from a central repository? At Elastic, we implemented our custom OpenTelemetry Collector receiver to collect GitHub Actions logs and combine it with the existing traces receiver to observe all workflows in our GitHub organization. Learn about the challenges we encountered, how we solved them, and see how centralized logs, traces, and metrics empower the analysis and visualization of GitHub workflows.

At Elastic, we use GitHub Actions in multiple repositories for our CI/CD pipelines. However, we faced challenges with decentralized logs, which made troubleshooting issues that spanned multiple workflow runs or repositories difficult.

In this session, we explain how we centralized GitHub Actions telemetry using OpenTelemetry Collector and how it helped us improve our analysis and visualization of GitHub workflows.

Initially, we focused on scanning logs to detect security vulnerabilities and creating a unified platform for searching, analyzing, and visualizing logs, complete with custom alerts and notifications.

As our project progressed, we realized the broader advantages of centralized logs combined with traces and metrics, which we are going to explore with real-world examples.

We will examine how we handled spikes in log volume, navigated GitHub Actions API rate limits, and ensured data integrity while implementing the custom OpenTelemetry Collector receiver for GitHub Actions log collection.

We planned to use OpenTelemetry Collector as the primary log receiver and exporter. To ensure reliability, we intended to queue webhook events with a proxy service, which sends them to the collector at a controlled pace and retries failed requests.

We will discuss how to fine-tune the receiver for log volume efficiency and optimize the collector's reliability. Visualizations will showcase the impacts of various configuration changes on performance, and we will explain why we did not implement the proxy service.

Finally, we will share real-world examples of how centralized logs, traces, and metrics have empowered our analysis and visualization capabilities by showcasing how we leveraged detection rules to find leaked secrets and sensitive information in logs, making identifying and remediating security vulnerabilities easier. showing how we used traces to identify bottlenecks and the most failing runs to optimize our workflows, demonstrating how centralized logs helped us identify the frequency of flaky commands and prioritize optimization and troubleshooting efforts, sharing how we crafted informational dashboards using the provided traces and metrics to help us find optimization opportunities.

Talk Questions

      
  • Question 766
    Do developer prefer to look at logs in Github Actions or in Elastic?
  • Question 769
    How do you enforce the usage of opentelemetry in every workflow in Github Action at organization level?
  • Question 770
    What is the log retention period in Elastic? Considering the high cost of Elastic, why not rely on GitHub’s UI for viewing logs instead?
  • Question 765
    what open source tool would you recommend to monitor OT logs/traces? grafana? kibana?
  • Question 760
    We are creating metrics in data dog manually, with open telemetry, do we can create automated metrics for DD? Open telemetry has some extension to do it ?
  • Question 761
    The dropped logs? How do you know what is happening?
  • Question 768
    Can we get these metrics from GitHub SaaS or we need our own workers?
  • Question 759
    You said traces measure time of action and metrics are numeric values in example of CPU consumption etc. I use metrics to collect process duration, is it a misuse?
  • Question 767
    How much effort do DevOps/SRE teams need to invest in comparison to running a managed service like Data dog to setup OTEL collector?
  • Question 762
    Can you add trigger of alarms depending on themetrics in your demo?
  • Question 763
    Can we use bitbucket instead of github actions?
  • Question 764
    Can CludWatch send info to the opentelemetry?
  • Question 771
    Looks like you‘re working around a lot of inefficiencies in GitHub that are easily solvable with GitLab and something like secret detection isn‘t an afterthought, but avoided in the first place. You should give it a try!
  • Question 776
    How could be a first approach building a first a platform.. removing ingress and add gateway api.. like we saw yesterday.. how coukd increase the complex in this approach?? Thanks