Think Our Observability Understanding Is Safe?

shyamrangaraju9
Jun 4, 2023
8 min read

Updated: Aug 16, 2023

IT Observability Understanding - What is observability, why is observability important, and why do we need observability?

Before going to Observability, we need to understand the traditional systems and today’s technologies where conventional methods had centralized IT systems with fewer moving parts like CPU, memory, databases, and network monitoring were good enough to understand, detect and fix any issues. Today, in the current technological world, the number and types of failures are high with distributed systems and multiple interconnected parts.

Hence, considering this complexity observability comes into play.

“Observability is the ability to measure the internal states of a system by examining its outputs.”

The goal of observability is to understand complex IT environments, which have components like hardware, software, cloud-native infrastructure, containers, open-source tools, microservices, serverless technologies, etc. environments, and related activities by analyzing the data so that it can resolve issues to keep our system efficient and reliable. For this purpose, observability relies on telemetry data such as metrics, logs, and traces – that provide a deep understanding of the distributed systems.

The building blocks of Observability

Metrics:

A metric is a numeric value measured over a specific period and includes attributes such as timestamps, KPIs (Key Performance Indicators), and values.
In observability, we focus on reliability and performance metrics, such as how much the memory usage, CPU capacity, latency, traffic, error rate, or saturation is. Metrics are structured by default and are easier to store and observe. We can retain them longer and analyze the system’s performance.

Logs:

The event log is an immutable human-readable form record of a historical event. It also includes a timestamp and a payload to provide context for the same. Most logs come in three formats – plain text, structured, and binary.
Out of the three, plain text is the most preferred format, but structured logs that carry additional information and metadata are becoming more popular because they’re easier to query.

Traces:

A trace represents an end-to-end request journey right through the distributed systems.
As the request moves through the IT systems, microservices perform various operations. Each of these operations is called “span,” and we encode it with important information related to microservices. So, if we analyze these traces, we can track how a particular request moved through the system, the operations performed on it, and the microservices involved. With this detailed information, SREs can identify the root cause of any bottleneck or breakdown.

Do these blocks solve our purpose? – While we can use each of these blocks individually, it won’t give us the desired results due to the unique limitations associated with each of these pillars.

For instance, logs can be challenging to sort and aggregate to draw meaningful conclusions; metrics are hard to tag and sort, so tricky to use them for troubleshooting, and traces can produce a lot of unnecessary data.

To avoid these challenges, we should opt for an integrated approach by combining these three blocks.

Four Key Components of Observability

To make our system observable, we should create a suite of tools and applications that can collect the telemetry data in terms of logs, metrics, or traces.

To achieve this, we will consider these components help us to implement observability in any ecosystem:

1. Instrumentation: Observability tools collect telemetry data from a container, service, application, host, and any other system component to enable visibility across the entire infrastructure. Examples of telemetry data include metrics, events, logs, and traces, often referred to as MELT.

There are two ways to instrument our code:

Auto-instrumentation uses shims or bytecode instrumentation agents to intercept our code at runtime or at compile-time to add tracing and metrics instrumentation to the libraries and frameworks we depend on.
Manual instrumentation uses an SDK and APIs we call from our services to provide observability into the operations within a service. It requires us to manually add spans, context propagation, attributes, etc. to our application code. It is akin to commenting on code or writing tests.
Auto-instrumentation is a quick way to start seeing our telemetry data in any Observability platform. As a general rule of thumb, it is best to start with auto-instrumentation if it’s available. Once that’s in place, we'll be able to see where the blind spots are in our system and we can start adding manual instrumentation as needed.
For example, auto-instrumentation doesn’t know our business logic—it only knows about frameworks and languages—in which case we'll want to manually instrument our business logic so that we get that additional visibility into the inner workings of our services.

2. Data correlation and context: We must process and analyze the telemetry data collected from various entities to establish a correlation. It also creates a context so humans can understand any patterns and anomalies developing within the ecosystem.

3. Incident response: Incident management and automation systems get outage data and then share it with relevant people or teams based on on-call schedules and technical skills.

4. AIOPS: Machine learning models automatically aggregate, correlate, and prioritize incident data. They also filter out unrelated data and create an alert signal or detect issues that can impact the system’s performance and efficiency. Ultimately, they accelerate incident response.

Now, we got the critical components of Observability, however, we need to bring benefit out of it.

Key Benefits of Observability

Better Customer Experience: Observability focuses on detecting and fixing issues proving to be bottlenecks for the system’s performance and efficiency. Therefore, implementing observability would enable us to improve system availability and high-end user experience.
Operational cost Reduction: Observability speeds up detecting and fixing infrastructure management issues. Observability focuses on reducing irrelevant or redundant information and prioritizing critical events. So, we can accomplish tasks with a small operation team and save a lot of money.
Higher visibility: SREs working in distributed environments often get challenged with visibility issues. They sometimes don’t know which services are in production, what application performance is, or who owns a particular deployment. But, with observability, they can get real-time visibility into production environments helping them to improve productivity.
Enhanced workflow: Observability allows SREs to trace a particular request’s journey from start to finish with contextualized data about one specific issue. It helps developers to streamline the investigation, debug the issue and optimize the overall workflow of the application.
Increased developer velocity: Continuous observability by SREs provides insight to the developers to increase the development speed by getting early access to access to the logs, metrics, and data related to that. The production cycles are improved regularly.
Actionable alerts - Proactive: SREs can detect and fix problems faster with observability implemented in our ecosystem. With deeper visibility, observability can see any issues earliest and provide the same info to related people or teams at any given time in the form of alerts or notifications.
Find out the unknown issues: With application performance monitoring tools, teams can find the known issues. However, observability addresses the unknown issues’ – the problem we don’t know exists in the ecosystem. It helps to identify issues, and their root causes to improve efficiency.

As we discussed benefits, we have to accept the challenges as well on observability.

Challenges in Observability

Understanding microservices and containers in real-time: The dynamic nature of these technologies makes it hard to have real-time visibility into their workloads, which is a crucial element of observability. Without proper tooling, it becomes impossible for IT teams to understand the internal state of all the relevant components. They either have to contact workflow architects who built the system or do guesswork, which is not the ideal approach with so many interdependencies.
The complexity of dynamic multi-cloud environments: Multi-cloud environments are evolving rapidly with a continuous influx of new technologies. This makes it very difficult for SREs to understand how everything’s working together. It requires a new set of tools and approaches to understand the interdependencies. Because without visibility into infrastructure, it’s impossible to implement observability practices.
Different data formats: To successfully implement observability, we should try to collect and aggregate telemetry data from various sources. However, it can be challenging to filter out the correct information and provide context when it comes to interpretation, as the same type of data comes from multiple sources in multiple formats. All this demands a strategy for structuring information of different forms into a standardized format.
Volume, velocity, and variety of data: Observability requires us to monitor every data point or visibility into a dynamic environment. Without having visibility, the teams then try to stitch up information from static dashboards using timestamps to identify events that may have caused failures. So, higher volume and velocity of data may hamper observability implementation.
Difficulty in analyzing business impact: Many organizations are focusing on accelerating development velocity and ensuring faster release instead of cleaning up technical debt. This is a major challenge in analyzing the business impact.

As we discussed challenges, we must adhere to the best practices to overcome the challenges and implement the best for our organizations.

Best Practices of Observability

Become proactive:

⮚ Tracing information back to the source to verify systems are working well.

⮚ Verify if the telemetry data collected is clean and usable.

⮚ Test visualizations and automated responses.

⮚ Monitor database and storage repository sizes.

⮚ Increase or reduce data inputs based on previous results.

Filter data to the point of creation: To avoid insignificant data, we must design our observability system that filters out unimportant data at multiple levels to reduce bandwidth hijacking and provide real-time and actionable insights.
Build holistic insights: Try to collect metrics from each application component such as infrastructure, application, serverless, middleware, or database. Moreover, we should ensure that logging is present for every piece, including third-party solutions. Lastly, configure logs for metadata, such as date and time, username, IP address, service name, status codes, tags, caller name, etc. Additionally, when the observability software processes data, it will have active granular metrics and passive monitoring through automated processes. Both these will combine to give us a more holistic view of our ecosystem and a more accurate insight.
Turn on data logging: We must enable creating standardized data logging mediums so that the Observability will aggregate & collect the data to understand the context of the data patterns.
Integrate AI/ML: Artificial intelligence and machine learning algorithms can be beneficial for incorporating automation into observability. It helps us to store and process vast amounts of data and unravel exciting patterns or insights that can help to improve application efficiency. Moreover, it also allows us to eliminate human errors and enable scaling capabilities with utmost ease.
Integrate observability tools with automated remediation systems: Many observability tools will discover issues or problems related to the kernel or operating system level. And most IT system administrators already have automated routines to fix those issues. In this scenario, integrating the observability software with the existing ecosystem will help us to maintain an optimized environment. Also, for areas where automation is not feasible, this integration will help us to filter out and fix issues and allow us to focus on business problems affecting user experience.
Create contextual reports: We shouldn’t view observability as the tool that will help IT system administrators or SRE teams. Instead, it has a much deeper meaning. It aims to bridge the gap between IT and business people by providing a contextual report on what we should do to improve system performance. A report provides the different persona views which should have the root cause of problems with trend analysis for SREs and their business impacts that management people could easily understand.
Identify the right vendors: Organizations need to figure out the tools that give them the best possible visibility into architecture and infrastructure, which are most interoperable, and reduce operational time and costs.

As we have an understanding of observability's importance, components, benefits, challenges, & best practices, now the focus should be on choosing the best observability tool to solve all our problems. How do we choose and what are the criteria to be met?

Criteria for good observability tool

Different tooling for observability building blocks

The market has introduced multiple tools for data ingestion that captures the logs, metrics, and traces to observe the system's availability. Let’s see some of the tools that are widely used for these blocks.

Prometheus for Metrics: Prometheus is a monitoring solution for recording and processing any purely numeric time series. It gathers, organizes, and stores metrics along with unique identifiers and timestamps.
ELK for Logs: Logstash: Collect logs and event data. Also parses and transforms data and sends it to Elasticsearch. Elasticsearch: Stores and indexes transformed data from Logstash. Kibana: A visualization tool that runs alongside Elasticsearch to allow users to analyze data and build powerful reports.
Lightstep/ Dynatrace/ Datadog/ Signoz for traces: These products have introduced different tools for data sampling mechanisms captured during data ingestion from traces.

The most powerful & commonly used tool that serves the purpose of capturing logs, metrics, and traces is OpenTelemetry. Let me talk about Open Telemetry in the next post.