Distributed Tracing

What is Distributed Tracing?

Distributed tracing is a method used in distributed systems to trace the path of an application request as it travels through different services and components – from frontend applications to middleware to backend services and database servers.

In a distributed system, a single request may be processed by multiple microservices. With distributed tracing, often without making any changes to your application code, each service records information about the request as it passes through, including timestamps, the name of the service, and any relevant metadata. This creates a trace, which is a step-by-step log of the request's journey through the system.

Some distributed tracing platforms require manual instrumentation or code modification to start tracing requests. Regardless of the method, a tag-and-follow approach is used for application transaction tracing. A unique identifier remains associated with the transaction throughout its interactions with microservices, containers, and underlying infrastructure.

Distributed tracing can identify bottlenecks and the root cause of slow performance

As a result, you get insight into the user experience, through the highest levels of the stack down to the application layer and the foundational infrastructure beneath the applications.

Distributed tracing provides visibility into how a request flows through a distributed system and helps identify performance issues, bottlenecks, and errors. With distributed transaction tracing, without making any changes to your application code, you can see slow methods, slow HTTP calls, slow database queries, exceptions in your code, etc. By analyzing the trace, developers can pinpoint where a request is spending the most time, identify areas for optimization, and troubleshoot issues quickly.

Distributed tracing is often used in conjunction with other observability tools, such as logging, error analysis, event monitoring, user experience monitoring and metrics, to provide a more comprehensive view of a distributed system's behavior.

To understand the relationship between traces and other signals such as metrics and logs, please see: The Three Pillars of Observability: Metrics, Logs and Traces (eginnovations.com).

What is the tag-and-follow methodology in Distributed Tracing? How Distributed Tracing works?

In distributed tracing, tags and follows are mechanisms used to correlate and trace the flow of requests across various components and services in a distributed system. Distributed tracing follows an interaction and tags it with a unique identifier. This identifier stays with the transaction as it interacts with microservices, containers, operating systems and infrastructure. The processing times of these uniquely identified requests are noted at each tier of the delivery chain and this information is then used to present simple views to developers, clearly highlighting where the application bottlenecks are. Let's take a closer look at what tags and follows mean in this context:

Tags: Tags are key-value pairs attached to a specific span (a unit of work or operation / a step) within the distributed trace. They provide additional context and metadata about the span, allowing for easier identification, filtering, and analysis. Tags can include information such as request ID, timestamps, error codes, and other relevant attributes.
For example, in a microservices architecture, when a request enters a service, a unique request ID can be assigned as a tag to that span. As the request propagates through different services, each service can add its own tags to the span, enriching it with relevant information. Tags help in understanding the behavior, performance, and flow of requests as they traverse the system.
Follows: Follows, also known as "follows from" or "follows relationship," are used to establish the causal relationship between different spans within a distributed trace. A "follows" relationship indicates that one span is a continuation of another span. It enables tracing the journey of a request from its origin to subsequent service calls.
When a service makes a request to another service, it creates a new span and establishes a "follows" relationship with the previous span. This connection allows the tracing system to visualize and reconstruct the complete path of the request as it flows through multiple services. Follows are key to understanding dependencies, latency, and overall system performance.

By leveraging tags and follows in distributed tracing, developers and IT teams gain visibility into the behavior and performance of a distributed system. They can analyze trace data, identify bottlenecks, pinpoint latency issues, and troubleshoot problems. Tags provide valuable context and metadata, while follows establish the causal relationship between spans, enabling end-to-end traceability across the system.

Why does Distributed Tracing matter?

The use of distributed tracing has grown in parallel with the rising usage of microservice architectures, and large-scale cloud (or even multi-cloud) systems and services. In modern software architectures, applications are often composed of multiple services and components that interact with each other to process user requests. These distributed systems bring numerous advantages, such as scalability, fault tolerance, and flexibility. These modern IT architectures introduce challenges when understanding how distributed components work together and identifying performance bottlenecks or issues.

Modern IT systems are designed to scale-up / scale-back with demand and have built in redundancy and failover. Load balancers, multi-component clusters of servers and other modern architectures mean that a request for a webpage and the pages delivery may take a vast number of different paths through a system.

Distributed tracing addresses these challenges by capturing and visualizing the flow of requests as they traverse through the specific services and components of a distributed system. By providing detailed visibility on individual request traces, tracing allows for in-depth analysis of latency, performance bottlenecks, and error propagation. It helps answer questions such as:

Where is time spent during the processing of a request?
Which services or components are responsible for most of the latency?
Are there any cascading failures or error propagation across services?
What are the dependencies and relationships between different components?

When is Distributed Tracing used?

The common use cases for distributed tracing include:

Troubleshooting and Debugging: Distributed tracing helps identify bottlenecks, slow response times, and other performance issues across multiple services, components, and systems in real-time. Developers can trace a request as it passes through multiple services and pinpoint where problems are occurring.
Performance Optimization: Distributed tracing can help identify and optimize slow or inefficient code, reducing latency and improving performance.
Root Cause Analysis: Distributed tracing helps determine the root cause of issues, such as system failures or performance problems, by providing end-to-end visibility of requests and transactions across different services and components.
Capacity Planning: Distributed tracing can help identify service dependencies and performance bottlenecks, enabling organizations to plan for and allocate resources accordingly.
Compliance and Security: Distributed tracing can be used to track and monitor data access and usage, ensuring compliance with regulations and security policies.

What are the limitations of Distributed Tracing? When are traces not enough?

While distributed tracing is a powerful technique for monitoring and analyzing the performance of distributed systems, there are some limitations to what it can do. Here are some scenarios where distributed tracing may not be enough:

Not all tracing tools provide automatic instrumentation: Distributed tracing is intended to save teams time and effort; however, some tools require developers to manually instrument or adjust their code to configure distributed tracing requests. This can be time-consuming and can result in code errors. Look for automated observability, AIOps and APM capabilities.
Tracing is not enough on its own: Distributed tracing provides visibility into the performance of individual components of a distributed system, but it may not provide enough context to understand the system. For example, distributed tracing may not capture the impact of network latency or infrastructure bottlenecks that affect the system's performance. You need converged observability and other data from metrics, logs and events. See: The Three Pillars of Observability: Metrics, Logs and Traces (eginnovations.com).
Limited to backend coverage: Many tools do not take an end-to-end approach to distributed tracing and only generate a trace ID for a request when it reaches the first back-end service, losing information pertaining to the user session on the frontend.

Modern tools will include frontend services allowing easy routing of issues to the frontend vs the backend teams as appropriate.
Head-based trace sampling: To overcome scalability challenges some tools sample traces (often on a randomized basis) at the start of each request, meaning some traces may be missing or incomplete. It may be inappropriate to sample high-priority traces, such as high-value transactions or requests from certain customers. Details on how eG Enterprise avoids sampling a subset of transactions can be found in, How to Get Full-Stack Visibility for Your Java Applications | White Paper (eginnovations.com).

In some cases, other monitoring techniques may be needed in addition to distributed tracing to fully understand the performance of a distributed system. These techniques may include log analysis, metrics monitoring, and synthetic testing.

Can Distributed Tracing identify code-level issues in applications?

Yes, in some products such as eG Enterprise code-level visibility is possible. Distributed transaction tracing with eG Enterprise helps application managers track and follow every user transaction from any device (web and mobile). Using byte-code instrumentation, eG Enterprise auto-discovers the URLs being accessed by users, tracks the time taken for the server to respond, and alerts when slowness is detected. A transaction flow map helps to visualize the different stages of transaction processing.

Using distributed transaction tracing, you can easily pinpoint the exact line of application code that is taking high processing time and causing slowness. Additionally, you can also detect if there are any slow queries to the database that are causing slowness or any slow third-party calls. This capability is currently available for Java and .NET web applications. Transaction tracing for Node.js and PHP are also supported.

Distributed tracing can also be a powerful tool to proactively optimize problematic or inefficient code within applications and services.

Observability - APM (Application Performance Monitoring) vs Distributed Tracing

Application Performance Monitoring (APM) and distributed tracing are two distinct but related concepts that play useful roles in understanding and monitoring application performance.

Distributed tracing focuses on tracking the flow of individual requests as they traverse through various services and components within a distributed system. It provides a detailed, end-to-end view of request traces, capturing latency, dependencies, and contextual information at a granular level. Distributed tracing helps analyze the performance characteristics of each request and identify bottlenecks or issues within specific spans or components. Tracing is a technique to follow the request path as it traverses application boundaries across layers and tiers. Traces are one telemetry type. APM uses traces along with other telemetry types such as metrics and logs to provide observability, see: The Three Pillars of Observability: Metrics, Logs and Traces (eginnovations.com).

APM provides a broad view of the overall performance and behavior of an application or system. It captures metrics such as response times, error rates, and resource utilization, offering insights into the health of the entire system or specific components.

Beyond application metrics, some APM tools can also provide host, infrastructure, and network metrics. This means that administrators can become aware of anomalous behavior and the warning signs of potential issues e.g., increasing server CPU or handle queues before users are impacted allowing them to stay ahead of any potential performance issues and proactively resolve them before users and transactions are actually impacted or experience performance issues or failures. Distributed tracing typically detects issues that have already arisen, and users are probably already experiencing issues.

What are the top tools for Distributed Tracing in IT monitoring?

There are many popular, widely used tools for implementing distributed tracing in distributed systems, including - OpenTelemetry, Jaeger (originally developed by Uber), Zipkin, AWS X-Ray (a managed tracing service provided by Amazon Web Services (AWS)), Dynatrace, Datadog and of course eG Enterprise.