If you've ever worked with on-premises environment, you know that, if your application becomes unresponsive, somebody could walk in, physically check why, and resolve the issue. Just saying ;) The point is you know how it is deployed, which physical server it exists, how it is connected to the network, etc., etc.,
But in cloud, how do you check that how your cloud deployment is behaving or if something is broken in production and your users might be impacted?
How do you know what's happening with your server, database, or application?
The answer is, there are too many tools to look through to find and resolve the issue. But you need some specific tools to capturing of telemetry, login, and trace data in the case of cloud.
Now, the question is "HOW?"
That's where Cloud Operations comes in.
Google Cloud Operations is a suite of products to monitor, troubleshoot, and operate your services at scale, enabling your DevOps, SRE, or IT operations teams to utilize the Google's SRE best practices. It also adds advanced observability features, including a debugger and a profiler.
The service provides monitoring, logging and diagnostics services to ensure good performance and availability. It gathers performance metrics and metadata from multiple cloud projects and lets IT teams view that data through custom monitoring dashboards, charts and reports. Cloud operations also enables organizations to troubleshoot incidents as they arise.
Google Cloud operations is natively integrated with GCP services and hosted on Google infrastructure. The monitoring capabilities can also be used for applications and VMs that run on Amazon Elastic Compute Cloud (EC2). In addition, it can pull performance data from open source systems, such as Cassandra, Nginx, Prometheus and Elasticsearch.
A brief history, Google Stackdriver was a monitoring service that provided IT teams with performance data about applications and VMs running on the GCP and AWS. Stackdriver was upgraded in 2020 with new features and rebranded as part of the Google Cloud operations suite of tools.
How the components play together
Capturing of telemetry, login, and trace data is done from the hardware layer and app for Google products.
From these products, signal data flows into Cloud Operation tools, where it can be visualized in dashboards and through the metrics explorer. Automated and custom logs can be ingested and analyzed in the log viewer. Services can be monitored for compliance with the service-level objectives. And error budgets can be tracked. Health checks can be used to check uptime and latency for external-facing sites and services. And running applications can be debugged and profiled.
When incidents occur, it can generate automated alerts that can notify personnel through various notification channels. Error reporting can help operation and developer teams to spot, count, and analyze crashes in your cloud-based services.
Finally, all the visualization and analysis tools can help to troubleshoot what is happening in your Google Cloud environment.
Cloud Operations Use cases
I bet that keeping your applications up and running and making sure your customers are happy is your #1 priority and Cloud Operations helps you do just that. By using SLOs that work across all application types and cloud environments or using error reporting to identify bugs in your applications, Cloud Operations provides your ops teams with the out of the box observability to monitor your infrastructure and your applications.