ElastiCache CDK Runbook
ElastiCache CDK Runbook¶
This runbook provides detailed instructions for creating new ElastiCache clusters using the CDK infrastructure. It’s designed for oncall engineers who need to add or modify ElastiCache resources.
Table of Contents¶
Architecture Overview¶
Our ElastiCache infrastructure uses AWS CDK to define and deploy Redis/Valkey clusters. The architecture consists of:
ElastiCacheStack: Main stack that creates all required ElastiCache resources
NestedElasticacheStack: Creates individual cache clusters with specific configurations
CacheNetwork: Sets up networking resources (subnet groups, security groups)
CacheAuthToken: Manages authentication for clusters
Currently deployed clusters:
auth-valkey: Used for authentication servicescloudauth-valkey: Used for cloud authentication servicesratelimit-valkey: Used for rate limiting functionality
Pre-requisites¶
Before creating new ElastiCache clusters, ensure you have:
Calculated the cache memory and network utilization. Ref: https://quip-amazon.com/JdIeAXIPWlbg/Valkey-Caches-node-calculation
VPC IDs for the target environments (check
vpc-config.ts)Understanding of the specific requirements for the new cluster
Step-by-Step Guide¶
1. Define Cluster Configuration¶
First, add your new cluster configuration in cdk/lib/elasticache/cacheConfig.ts:
// For production environments
const defaultProdConfigs: { [key: string]: ClusterReplicationGroupConfig } = {
// Existing configurations...
[ReplicationGroupName.YOUR_NEW_CLUSTER]: {
MinReplicasCount: 3, // Adjust based on needs
MinShardsCount: 2, // Adjust based on needs
MaxShardsCount: 10, // Maximum capacity for scaling
NodeType: NodeType.M7G_XLARGE, // Choose appropriate node type
isClustered: true, // true if using Redis Cluster mode
family: CacheFamily.VALKEY8, // Choose valkey8 or redis7
evictionType: EvictionType.VOLATILE_LRU, // Memory management policy
isTransitEnabled: true, // Enable in-transit encryption
isRestEnabled: true, // Enable at-rest encryption
isAuthEnabled: true, // Enable authentication
timeout: '3600', // Connection timeout in seconds
engine: 'valkey' // 'valkey' or 'redis'
},
}
// For development environments
export const cacheClusterConfigDev: { [key: string]: ClusterReplicationGroupConfig } = {
// Existing configurations...
[ReplicationGroupName.YOUR_NEW_CLUSTER]: {
...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
MinReplicasCount: 1,
MinShardsCount: 1,
}
};
2. Add Your Cluster to the ReplicationGroupName Enum¶
In cdk/lib/elasticache/cacheInterface.ts, add your new cluster name to the enum:
export enum ReplicationGroupName {
RATELIMIT = 'ratelimit-valkey',
AUTH = 'auth-valkey',
CLOUDAUTH = 'cloudauth-valkey',
YOUR_NEW_CLUSTER = 'your-new-cluster-valkey', // Add your new cluster
}
3. Add Regional Configuration (if needed)¶
If your cluster requires region-specific configurations, update the cacheClusterConfigProd and cacheClusterAutoScalingConfigProd objects in cdk/lib/elasticache/cacheConfig.ts:
// Region-specific prod configurations
export const cacheClusterConfigProd: { [region: string]: { [key: string]: ClusterReplicationGroupConfig } } = {
'us-east-1': {
// Existing configurations...
[ReplicationGroupName.YOUR_NEW_CLUSTER]: {
...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
// Override specific parameters for this region if needed
},
},
'us-west-2': {
// Existing configurations...
[ReplicationGroupName.YOUR_NEW_CLUSTER]: {
...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
MinReplicasCount: 2, // Example region-specific override
MinShardsCount: 2,
},
},
'eu-west-1': {
// Existing configurations...
[ReplicationGroupName.YOUR_NEW_CLUSTER]: {
...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
},
},
};
4. Update ElastiCacheStack to Include the New Cluster¶
In cdk/lib/stacks/ElastiCacheStack.ts, add your new cluster to the cacheClusters array:
// List of all cache clusters
const cacheClusters = [
{ name: 'auth-valkey' },
{ name: 'cloudauth-valkey' },
{ name: 'ratelimit-valkey' },
{ name: 'your-new-cluster-valkey' }, // Add your new cluster here
];
5. Verify VPC Configuration¶
Ensure the required VPC IDs are available in cdk/lib/elasticache/vpc-config.ts. If adding a new region, update the VPC mappings:
export const VPC_MAPPING: VpcMapping = {
FireFly: {
dev: {
"us-east-1": "vpc-0403836a59c5b52b6",
"us-west-2": "vpc-0bf609a339d704999",
"eu-west-1": "vpc-0406c3422f20e3e8d",
"new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
},
prod: {
"us-east-1": "vpc-0236dce4046db1108",
"us-west-2": "vpc-073872f3c36c9e31a",
"eu-west-1": "vpc-08cf128db50ac9e8f",
"new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
},
},
};
Configuration Options¶
Node Types¶
Choose the appropriate instance type based on workload requirements:
Node Type |
vCPU |
Memory |
Use Case |
|---|---|---|---|
cache.m7g.large |
2 |
6.5 GB |
General purpose |
cache.m7g.xlarge |
4 |
13 GB |
General purpose |
cache.r7g.large |
2 |
13.1 GB |
Memory optimized |
cache.r7g.xlarge |
4 |
26.3 GB |
Memory optimized |
cache.r7g.2xlarge |
8 |
52.5 GB |
Memory optimized |
cache.r7g.16xlarge |
64 |
420.0 GB |
Memory optimized |
cache.c7gn.large |
2 |
3.2 GB |
Compute optimized, network optimized |
cache.c7gn.2xlarge |
8 |
12.8 GB |
Compute optimized, network optimized |
Eviction Policies¶
Policy |
Description |
Use Case |
|---|---|---|
volatile-lru |
Evict keys with TTL using LRU |
General purpose, when TTLs are set |
allkeys-lru |
Evict any key using LRU |
When you want to use Redis as LRU cache |
volatile-lfu |
Evict keys with TTL using LFU |
Better than LRU when access frequency varies |
allkeys-lfu |
Evict any key using LFU |
When some items are accessed more frequently |
volatile-random |
Evict random keys with TTL |
When uniform distribution is desired |
volatile-ttl |
Evict keys with shortest TTL |
When shorter TTL indicates lower value |
noeviction |
Return errors when memory is full |
When data loss is not acceptable |
Regional Considerations¶
Each region may require different configurations based on:
Traffic patterns: Higher traffic regions may need more shards and replicas
Data residency requirements: Some clusters may need specific encryption settings
Performance requirements: Response time SLAs might dictate instance types
Cost optimization: Lower traffic regions can use smaller/fewer instances
Current supported regions:
us-east-1 (N. Virginia)
us-west-2 (Oregon)
eu-west-1 (Ireland)
To add a new region, update both the VPC configuration and region-specific cache settings.
Deployment Process¶
Development Environment¶
Deploy to dev environment first to validate configuration:
cd cdk
yarn deployElastiCache:iad
yarn deployElastiCache:pdx
yarn deployElastiCache:dub
Production Environment¶
After testing in dev, Submit an MR:
In the commit message you should include
[cache-deployment]to trigger a cache dedicated pipeline. Ref: https://quip-amazon.com/61r8AblahGLI/Metalfly-Cache-Pipeline-Setup
You can refer the CI/CD pipelines defined in the pipelines/ directory.
Testing and Validation¶
After deployment, verify that:
Cluster is accessible: Test connectivity from appropriate services
Authentication works: If auth is enabled, verify credentials work
Performance meets expectations: Check throughput and latency
Autoscaling functions correctly: Monitor CloudWatch metrics during load tests
Connectivity Testing¶
You can test connectivity using Redis CLI:
# With auth
redis-cli -h <endpoint without port> -c -a <auth-token> --tls
# Without auth
redis-cli -h <endpoint without port> -c
Common CloudWatch Metrics to Monitor¶
EngineCPUUtilization: Should stay below target (default: 70%)DatabaseMemoryUsageCountedForEvictPercentage: Should stay below target (default: 75%)CurrConnections: Monitor for unexpected connection patternsReplicationLag: Should be minimal in normal operationsGetTypeCmdsandSetTypeCmds: Monitor throughput
Troubleshooting¶
Common Issues¶
Deployment Failures
Issue: Stack creation fails due to parameter errors
Solution: Verify all required parameters are correctly set in configurations
Connectivity Issues
Issue: Services can’t connect to the cluster
Solution: Check security groups, subnet routing, and network ACLs
Authentication Errors
Issue: AUTH failures when connecting
Solution: Verify auth token in Secrets Manager and client configurations
Cluster Mode Errors
Issue: Commands fail with “CROSSSLOT” errors
Solution: Ensure client is using cluster-aware driver with appropriate configuration
Performance Issues
Issue: High latency or throughput limitations
Solution: Check instance type, shard count, and connection patterns
Support Escalation¶
If you encounter issues that can’t be resolved:
Check internal documentation on Cache Oncall Run-book
Open a service troubleshooting ticket in AWS support console.
Contact the platform team responsible for ElastiCache infrastructure. Slack: https://amazon.enterprise.slack.com/archives/C018P0RTJ1W
