ElastiCache CDK Runbook

ElastiCache CDK Runbook¶

This runbook provides detailed instructions for creating new ElastiCache clusters using the CDK infrastructure. It’s designed for oncall engineers who need to add or modify ElastiCache resources.

Table of Contents¶

Architecture Overview
Pre-requisites
Step-by-Step Guide
Configuration Options
Regional Considerations
Deployment Process
Testing and Validation
Troubleshooting

Architecture Overview¶

Our ElastiCache infrastructure uses AWS CDK to define and deploy Redis/Valkey clusters. The architecture consists of:

ElastiCacheStack: Main stack that creates all required ElastiCache resources
NestedElasticacheStack: Creates individual cache clusters with specific configurations
CacheNetwork: Sets up networking resources (subnet groups, security groups)
CacheAuthToken: Manages authentication for clusters

Currently deployed clusters:

auth-valkey: Used for authentication services
cloudauth-valkey: Used for cloud authentication services
ratelimit-valkey: Used for rate limiting functionality

Pre-requisites¶

Before creating new ElastiCache clusters, ensure you have:

Calculated the cache memory and network utilization. Ref: https://quip-amazon.com/JdIeAXIPWlbg/Valkey-Caches-node-calculation
VPC IDs for the target environments (check vpc-config.ts)
Understanding of the specific requirements for the new cluster

Step-by-Step Guide¶

1. Define Cluster Configuration¶

First, add your new cluster configuration in cdk/lib/elasticache/cacheConfig.ts:

// For production environments
const defaultProdConfigs: { [key: string]: ClusterReplicationGroupConfig } = {
    // Existing configurations...
    
    [ReplicationGroupName.YOUR_NEW_CLUSTER]: {
        MinReplicasCount: 3, // Adjust based on needs
        MinShardsCount: 2,   // Adjust based on needs
        MaxShardsCount: 10,  // Maximum capacity for scaling
        NodeType: NodeType.M7G_XLARGE, // Choose appropriate node type
        isClustered: true,   // true if using Redis Cluster mode
        family: CacheFamily.VALKEY8, // Choose valkey8 or redis7
        evictionType: EvictionType.VOLATILE_LRU, // Memory management policy
        isTransitEnabled: true,  // Enable in-transit encryption
        isRestEnabled: true,     // Enable at-rest encryption
        isAuthEnabled: true,     // Enable authentication
        timeout: '3600',         // Connection timeout in seconds
        engine: 'valkey'         // 'valkey' or 'redis'
    },
}

// For development environments
export const cacheClusterConfigDev: { [key: string]: ClusterReplicationGroupConfig } = {
    // Existing configurations...
    
    [ReplicationGroupName.YOUR_NEW_CLUSTER]: {
        ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
        MinReplicasCount: 1,
        MinShardsCount: 1,
    }
};

2. Add Your Cluster to the ReplicationGroupName Enum¶

In cdk/lib/elasticache/cacheInterface.ts, add your new cluster name to the enum:

export enum ReplicationGroupName {
    RATELIMIT = 'ratelimit-valkey',
    AUTH = 'auth-valkey',
    CLOUDAUTH = 'cloudauth-valkey',
    YOUR_NEW_CLUSTER = 'your-new-cluster-valkey', // Add your new cluster
}

3. Add Regional Configuration (if needed)¶

If your cluster requires region-specific configurations, update the cacheClusterConfigProd and cacheClusterAutoScalingConfigProd objects in cdk/lib/elasticache/cacheConfig.ts:

// Region-specific prod configurations
export const cacheClusterConfigProd: { [region: string]: { [key: string]: ClusterReplicationGroupConfig } } = {
    'us-east-1': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
            // Override specific parameters for this region if needed
        },
    },
    'us-west-2': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
            MinReplicasCount: 2,  // Example region-specific override
            MinShardsCount: 2,
        },
    },
    'eu-west-1': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
        },
    },
};

4. Update ElastiCacheStack to Include the New Cluster¶

In cdk/lib/stacks/ElastiCacheStack.ts, add your new cluster to the cacheClusters array:

// List of all cache clusters
const cacheClusters = [
    { name: 'auth-valkey' },
    { name: 'cloudauth-valkey' },
    { name: 'ratelimit-valkey' },
    { name: 'your-new-cluster-valkey' }, // Add your new cluster here
];

5. Verify VPC Configuration¶

Ensure the required VPC IDs are available in cdk/lib/elasticache/vpc-config.ts. If adding a new region, update the VPC mappings:

export const VPC_MAPPING: VpcMapping = {
  FireFly: {
    dev: {
      "us-east-1": "vpc-0403836a59c5b52b6",
      "us-west-2": "vpc-0bf609a339d704999",
      "eu-west-1": "vpc-0406c3422f20e3e8d",
      "new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
    },
    prod: {
      "us-east-1": "vpc-0236dce4046db1108",
      "us-west-2": "vpc-073872f3c36c9e31a",
      "eu-west-1": "vpc-08cf128db50ac9e8f",
      "new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
    },
  },
};

Configuration Options¶

Node Types¶

Choose the appropriate instance type based on workload requirements:

Node Type	vCPU	Memory	Use Case
cache.m7g.large	2	6.5 GB	General purpose
cache.m7g.xlarge	4	13 GB	General purpose
cache.r7g.large	2	13.1 GB	Memory optimized
cache.r7g.xlarge	4	26.3 GB	Memory optimized
cache.r7g.2xlarge	8	52.5 GB	Memory optimized
cache.r7g.16xlarge	64	420.0 GB	Memory optimized
cache.c7gn.large	2	3.2 GB	Compute optimized, network optimized
cache.c7gn.2xlarge	8	12.8 GB	Compute optimized, network optimized

Eviction Policies¶

Policy	Description	Use Case
volatile-lru	Evict keys with TTL using LRU	General purpose, when TTLs are set
allkeys-lru	Evict any key using LRU	When you want to use Redis as LRU cache
volatile-lfu	Evict keys with TTL using LFU	Better than LRU when access frequency varies
allkeys-lfu	Evict any key using LFU	When some items are accessed more frequently
volatile-random	Evict random keys with TTL	When uniform distribution is desired
volatile-ttl	Evict keys with shortest TTL	When shorter TTL indicates lower value
noeviction	Return errors when memory is full	When data loss is not acceptable

Regional Considerations¶

Each region may require different configurations based on:

Traffic patterns: Higher traffic regions may need more shards and replicas
Data residency requirements: Some clusters may need specific encryption settings
Performance requirements: Response time SLAs might dictate instance types
Cost optimization: Lower traffic regions can use smaller/fewer instances

Current supported regions:

us-east-1 (N. Virginia)
us-west-2 (Oregon)
eu-west-1 (Ireland)

To add a new region, update both the VPC configuration and region-specific cache settings.

Deployment Process¶

Development Environment¶

Deploy to dev environment first to validate configuration:

cd cdk
yarn deployElastiCache:iad
yarn deployElastiCache:pdx
yarn deployElastiCache:dub

Production Environment¶

After testing in dev, Submit an MR:

In the commit message you should include [cache-deployment] to trigger a cache dedicated pipeline. Ref: https://quip-amazon.com/61r8AblahGLI/Metalfly-Cache-Pipeline-Setup

You can refer the CI/CD pipelines defined in the pipelines/ directory.

Testing and Validation¶

After deployment, verify that:

Cluster is accessible: Test connectivity from appropriate services
Authentication works: If auth is enabled, verify credentials work
Performance meets expectations: Check throughput and latency
Autoscaling functions correctly: Monitor CloudWatch metrics during load tests

Connectivity Testing¶

You can test connectivity using Redis CLI:

# With auth
redis-cli -h <endpoint without port> -c -a <auth-token> --tls

# Without auth
redis-cli -h <endpoint without port> -c

Common CloudWatch Metrics to Monitor¶

EngineCPUUtilization: Should stay below target (default: 70%)
DatabaseMemoryUsageCountedForEvictPercentage: Should stay below target (default: 75%)
CurrConnections: Monitor for unexpected connection patterns
ReplicationLag: Should be minimal in normal operations
GetTypeCmds and SetTypeCmds: Monitor throughput

Troubleshooting¶

Common Issues¶

Deployment Failures
- Issue: Stack creation fails due to parameter errors
- Solution: Verify all required parameters are correctly set in configurations
Connectivity Issues
- Issue: Services can’t connect to the cluster
- Solution: Check security groups, subnet routing, and network ACLs
Authentication Errors
- Issue: AUTH failures when connecting
- Solution: Verify auth token in Secrets Manager and client configurations
Cluster Mode Errors
- Issue: Commands fail with “CROSSSLOT” errors
- Solution: Ensure client is using cluster-aware driver with appropriate configuration
Performance Issues
- Issue: High latency or throughput limitations
- Solution: Check instance type, shard count, and connection patterns

Support Escalation¶

If you encounter issues that can’t be resolved:

Check internal documentation on Cache Oncall Run-book
Open a service troubleshooting ticket in AWS support console.
Contact the platform team responsible for ElastiCache infrastructure. Slack: https://amazon.enterprise.slack.com/archives/C018P0RTJ1W

Next page: CloudWatch Insights Cookbook

Previous page: Debugging Cache Entries

Firefly