CloudWatch Insights Cookbook

CloudWatch Insights Cookbook

Overview

This document is intended to serve as a repository for particularly helpful CloudWatch insights queries, along with descriptions of what makes them useful.

Queries

Grafana Query Explorer Link: https://tiny.amazon.com/c1s8p1c1/ga10grafusweamazexpl

CloudWatch Query Explorer Link: https://tiny.amazon.com/6sp3n0i7/condsecua2zapicons

Requests by a customer ID

Log Group: /aws/lambda/MusicFirefly-prod-graphqlCore

fields
  @timestamp,
  payload.requestContext.authorizer.customerId as customerId,
  @message |
filter @message like "GraphqlCore event" |
sort @timestamp desc |
limit 10

Request client details

Log Group: /aws/lambda/MusicFirefly-prod-graphqlCore

fields
  @requestId as requestId,
  `payload.headers.x-api-key` as clientApiKey,
  payload.requestContext.authorizer.principalId as principalId,
  payload.requestContext.authorizer.deviceType as deviceType,
  payload.requestContext.authorizer.deviceFamily as deviceFamily,
  payload.requestContext.authorizer.deviceId as deviceId,
  @message |
filter @message like "GraphqlCore event" |
sort @timestamp desc |
limit 10

Notes:

  • I alias @requestId as requestId. This is because in the collapsed preview view, Grafana shows the first key that doesn’t start with an @ . Without this alias, the preview view would show me the principalId.

  • I enclose payload.headers.x-api-key in backticks. When a key has a hyphen in, CloudWatch insights will not properly match the key, and your result will appear null for that field.

Logs for a single request ID

Log Group: /aws/lambda/MusicFirefly-prod-graphql

fields
  @timestamp,
  @message |
filter level != "METRIC"
  and @requestId = "65a0c72b-4b66-47c3-a633-396084e8587e" |
sort @timestamp desc

Notes:

  • I filter by level != “METRIC”. If you want to look up specific metrics. It is advised that you query our Timestream metrics store, rather than Cloudwatch Logs.

Getting auth errors for Amazon Music clients

Log Group: /aws/lambda/MusicFirefly-prod-graphql

fields @timestamp, @message, 
  payload.context.deviceFamily as DeviceFamily,
  payload.context.deviceType as DeviceType,
  payload.context.error as AuthError,
  payload.principalId as Client

| sort @timestamp desc
| filter payload.principalId like "Unknown"
| filter level like "ALERT"
| filter message like "Returned Auth Policy"
| stats count(*) as Count by Client, DeviceFamily, DeviceType, AuthError
| limit 20

Graph of stratus timeout errors

Log Group: /aws/lambda/MusicFirefly-prod-graphql

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "ERROR"
| filter @message like "stratus"
| stats count(payload.response.data.message) by bin(5min)

Getting latency stats

Our Grafana metrics are only kept alive for a couple days. In the event you need metrics farther in the past, you can query CloudWatch like so:

fields @timestamp, @message, payload.metricValue
| sort @timestamp desc
| filter level = 'METRIC'
    and payload.dimensions.RequestTrace = 'Auth'
    and payload.metricName = 'Duration'
| stats pct(payload.metricValue, 99) as p99,
        pct(payload.metricValue, 90) as p90,
        pct(payload.metricValue, 50) as p50 by bin(5m)

Getting error logs from a single service

Because most/all HTTP requests are sent with Axios, we can log based on the Axios response format, while including the URL being requested. Just copy the URL from one request to this service and add it in the filter line below:

fields @timestamp, @message, @logStream, @log
| filter @message like '[Axios][Error] POST https://mis-q1t-vui-na-p-tcp.iad.amazon.com'
| sort @timestamp desc
| limit 20

Getting TPS estimates for given auth types

Use the following query on both graphql and graphqlCore log groups (to cover cross-region calls) and divide the time range by the number of seconds. i.e. if doing 1 day range, divide the totals by (60 60 24). Make sure to check for each of the 3 regions.

fields @timestamp, @message
| filter message = 'GraphqlCore event'
| parse payload.headers.Authorization '* *' as authType, authKey
| display coalesce(authType, 'IAM') as authTypeFmt # if no auth is set, it MUST be IAM
| stats count(*) by authTypeFmt

Displaying count stats for a parsed field

In this case, we wanted to track any service APIs that are being called (the ‘target’) without the TransitiveAuth token being attached, in order to confirm it isn’t being dropped from any important HTTP requests. First we extract the target field, then we filter by Axios request format to get only outgoing HTTP requests, then we filter down to only these HTTP requests without the TransitiveAuth token attached, and lastly we group by target to show a table w/ counts.

fields @timestamp, @message, @logStream, @log
| parse @message '"X-Amz-Target":"*",' as target
| filter @message like '[Axios][Request] POST'
| filter @message not like 'x-amzn-transitive-authentication-token'
| stats count(*) by target

[DMA] Transitive Auth Queries

Number of Service calls NOT using TransitiveAuth (grouped by API)

fields @timestamp, @message, @logStream, @log
| parse @message '"X-Amz-Target":"*",' as target
| filter @message like '[Axios][Request] POST'
| filter @message not like 'x-amzn-transitive-authentication-token'
| stats count(*) by target

Number of requests w/ TA Token passed in from client

fields @timestamp, @message, @logStream, @log
| filter @message like 'GraphqlCore event'
| parse @message '"x-amzn-transitive-authentication-token": "*",' as taToken
| stats count(*) by isPresent(taToken)

Number of MIS responses w/ TA Token (and without)

fields @timestamp, @message, @logStream, @log
| filter @message like '[Axios][Response] POST https://mis-q1t-vui-na-p-tcp.iad.amazon.com'
| filter @message not like 'musicRequestIdentityContext'
| filter @message not like 'profileIdentityDirectedId'
| parse @message '"transitiveAuthToken":"*"' as taToken
| stats count(*) by isPresent(taToken)

Note that the tokens not present in the above query will be using placeholder token

Number of MIS responses w/ TA token (for IAM callers specifically)

fields @timestamp, @message, @logStream, @log
| filter @message like '[Axios][Response] POST https://mis-q1t-vui-na-p-tcp.iad.amazon.com'
| filter @message not like 'musicRequestIdentityContext'
| filter @message not like 'profileIdentityDirectedId'
| filter @message not like 'customerId'
| parse @message '"transitiveAuthToken":"*"' as taToken
| stats count(*) by isPresent(taToken)

Number of requests w/ Placeholder token for each auth scenario

fields @timestamp, @message, @logStream, @log
| filter @message like 'Generated Placeholder TA token for auth scenario'
| parse @message 'Generated Placeholder TA token for auth scenario *"' as authScenario
| stats count(*) by authScenario