There are many aspects of using Cloud Operations to improve the observability and reliability of your services running on Google Cloud.
But how should you approach setting up and using a suite as your cloud footprint grows? Before we start, let's take a step back and answer this simple question.
- Why do we do all of these things?
- What are we trying to achieve, overall, when using a suite like Cloud Operations?
Ultimately, we're after three main objectives
- Security and Compliance
- Monitoring and Onservability
- Cost Optimization
The first one is to make sure that our services are secure and that we comply with policies and regulations that are specifically relevant in our context. That means we need to manage things like data access and an audit trail.
The second one is the objective, keeping our users happy. That means we need to know if user experience degrades, hopefully by using SLOs, and what to do about it.
The final goal is managing the capacity required to run our services while ensuring that we optimize for cost.
Enterprise best practices
1) Centralize monitoring, auditing and observability data in a single place.
For services running on Google Cloud, we recommend using Cloud Operations to do just that. Having this information in a single repository with easy cross-team access greatly improves practices like architecture design, incident response, and root cause analysis.
For example, if I know the track record of historical reliability of my dependencies, I can better architect my own service to meet my reliability objectives. At the same time, I can do a better job of incident response if I can see whether the services I depend on are having an incident. It's really difficult to do things like this without having all this information in a single place.
2) Ensure that a clear and easily accessible audit trail exists for your services.
Using cloud audit logs to help answer questions like who did what, where, and when in your Google Cloud projects, and managing IAM controls to limit who has access to view audit logs.
Cloud Audit Logs provides the following audit logs for each Cloud project, folder, and organization:
- Admin Activity audit logs
- Data Access audit logs
- System Event audit logs
- Policy Denied audit logs
- As you make audit log configuration changes, use a test Google Cloud project to validate them before propagating to developer and production projects.
- When considering how you grant access to audit logs across the organization, we recommend using a least privilege approach to granting permissions.
- Determine whether you need to export logs for longer-term retention
- Configure log sinks before you start receiving logs
- Set appropriate IAM controls against the export sink destination
Note that data access audit logs are off by default, except for BigQuery.
When you enable new Google Cloud Services, evaluate whether or not to enable data access audit logs for that new service.
3) Thoughtfully manage log retention and analysis.
Consider the various retention needs for compliance requirements, security and access analytics and event analytics.
The first way to do this is by using log views. You can use logs to control who has access to logs within your log buckets based on your needs and requirements.
Custom log views
Custom log views provides you with a granular way to control access to the logs in your log buckets.
You can also carefully control log retention by using project-level or aggregated sinks to export logs to BigQuery, Cloud Storage, and Pub/Sub for analytics, archival, or consumption by other systems.
Each of these do have their own mechanisms for controlling retention and access, and you should carefully consider those as you plan your retention strategy.
I hope these recommendations help you keep your services secure and to keep your users happy.