ElastiCache CDK Runbook

ElastiCache CDK Runbook

This runbook provides detailed instructions for creating new ElastiCache clusters using the CDK infrastructure. It’s designed for oncall engineers who need to add or modify ElastiCache resources.

Table of Contents

  1. Architecture Overview

  2. Pre-requisites

  3. Step-by-Step Guide

  4. Configuration Options

  5. Regional Considerations

  6. Deployment Process

  7. Testing and Validation

  8. Troubleshooting

Architecture Overview

Our ElastiCache infrastructure uses AWS CDK to define and deploy Redis/Valkey clusters. The architecture consists of:

  • ElastiCacheStack: Main stack that creates all required ElastiCache resources

  • NestedElasticacheStack: Creates individual cache clusters with specific configurations

  • CacheNetwork: Sets up networking resources (subnet groups, security groups)

  • CacheAuthToken: Manages authentication for clusters

Currently deployed clusters:

  • auth-valkey: Used for authentication services

  • cloudauth-valkey: Used for cloud authentication services

  • ratelimit-valkey: Used for rate limiting functionality

Pre-requisites

Before creating new ElastiCache clusters, ensure you have:

  1. Calculated the cache memory and network utilization. Ref: https://quip-amazon.com/JdIeAXIPWlbg/Valkey-Caches-node-calculation

  2. VPC IDs for the target environments (check vpc-config.ts)

  3. Understanding of the specific requirements for the new cluster

Step-by-Step Guide

1. Define Cluster Configuration

First, add your new cluster configuration in cdk/lib/elasticache/cacheConfig.ts:

// For production environments
const defaultProdConfigs: { [key: string]: ClusterReplicationGroupConfig } = {
    // Existing configurations...
    
    [ReplicationGroupName.YOUR_NEW_CLUSTER]: {
        MinReplicasCount: 3, // Adjust based on needs
        MinShardsCount: 2,   // Adjust based on needs
        MaxShardsCount: 10,  // Maximum capacity for scaling
        NodeType: NodeType.M7G_XLARGE, // Choose appropriate node type
        isClustered: true,   // true if using Redis Cluster mode
        family: CacheFamily.VALKEY8, // Choose valkey8 or redis7
        evictionType: EvictionType.VOLATILE_LRU, // Memory management policy
        isTransitEnabled: true,  // Enable in-transit encryption
        isRestEnabled: true,     // Enable at-rest encryption
        isAuthEnabled: true,     // Enable authentication
        timeout: '3600',         // Connection timeout in seconds
        engine: 'valkey'         // 'valkey' or 'redis'
    },
}

// For development environments
export const cacheClusterConfigDev: { [key: string]: ClusterReplicationGroupConfig } = {
    // Existing configurations...
    
    [ReplicationGroupName.YOUR_NEW_CLUSTER]: {
        ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
        MinReplicasCount: 1,
        MinShardsCount: 1,
    }
};

2. Add Your Cluster to the ReplicationGroupName Enum

In cdk/lib/elasticache/cacheInterface.ts, add your new cluster name to the enum:

export enum ReplicationGroupName {
    RATELIMIT = 'ratelimit-valkey',
    AUTH = 'auth-valkey',
    CLOUDAUTH = 'cloudauth-valkey',
    YOUR_NEW_CLUSTER = 'your-new-cluster-valkey', // Add your new cluster
}

3. Add Regional Configuration (if needed)

If your cluster requires region-specific configurations, update the cacheClusterConfigProd and cacheClusterAutoScalingConfigProd objects in cdk/lib/elasticache/cacheConfig.ts:

// Region-specific prod configurations
export const cacheClusterConfigProd: { [region: string]: { [key: string]: ClusterReplicationGroupConfig } } = {
    'us-east-1': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
            // Override specific parameters for this region if needed
        },
    },
    'us-west-2': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
            MinReplicasCount: 2,  // Example region-specific override
            MinShardsCount: 2,
        },
    },
    'eu-west-1': {
        // Existing configurations...
        [ReplicationGroupName.YOUR_NEW_CLUSTER]: { 
            ...defaultProdConfigs[ReplicationGroupName.YOUR_NEW_CLUSTER],
        },
    },
};

4. Update ElastiCacheStack to Include the New Cluster

In cdk/lib/stacks/ElastiCacheStack.ts, add your new cluster to the cacheClusters array:

// List of all cache clusters
const cacheClusters = [
    { name: 'auth-valkey' },
    { name: 'cloudauth-valkey' },
    { name: 'ratelimit-valkey' },
    { name: 'your-new-cluster-valkey' }, // Add your new cluster here
];

5. Verify VPC Configuration

Ensure the required VPC IDs are available in cdk/lib/elasticache/vpc-config.ts. If adding a new region, update the VPC mappings:

export const VPC_MAPPING: VpcMapping = {
  FireFly: {
    dev: {
      "us-east-1": "vpc-0403836a59c5b52b6",
      "us-west-2": "vpc-0bf609a339d704999",
      "eu-west-1": "vpc-0406c3422f20e3e8d",
      "new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
    },
    prod: {
      "us-east-1": "vpc-0236dce4046db1108",
      "us-west-2": "vpc-073872f3c36c9e31a",
      "eu-west-1": "vpc-08cf128db50ac9e8f",
      "new-region": "vpc-xxxxxxxxxxxxxxxxx", // Add new region if needed
    },
  },
};

Configuration Options

Node Types

Choose the appropriate instance type based on workload requirements:

Node Type

vCPU

Memory

Use Case

cache.m7g.large

2

6.5 GB

General purpose

cache.m7g.xlarge

4

13 GB

General purpose

cache.r7g.large

2

13.1 GB

Memory optimized

cache.r7g.xlarge

4

26.3 GB

Memory optimized

cache.r7g.2xlarge

8

52.5 GB

Memory optimized

cache.r7g.16xlarge

64

420.0 GB

Memory optimized

cache.c7gn.large

2

3.2 GB

Compute optimized, network optimized

cache.c7gn.2xlarge

8

12.8 GB

Compute optimized, network optimized

Eviction Policies

Policy

Description

Use Case

volatile-lru

Evict keys with TTL using LRU

General purpose, when TTLs are set

allkeys-lru

Evict any key using LRU

When you want to use Redis as LRU cache

volatile-lfu

Evict keys with TTL using LFU

Better than LRU when access frequency varies

allkeys-lfu

Evict any key using LFU

When some items are accessed more frequently

volatile-random

Evict random keys with TTL

When uniform distribution is desired

volatile-ttl

Evict keys with shortest TTL

When shorter TTL indicates lower value

noeviction

Return errors when memory is full

When data loss is not acceptable

Regional Considerations

Each region may require different configurations based on:

  1. Traffic patterns: Higher traffic regions may need more shards and replicas

  2. Data residency requirements: Some clusters may need specific encryption settings

  3. Performance requirements: Response time SLAs might dictate instance types

  4. Cost optimization: Lower traffic regions can use smaller/fewer instances

Current supported regions:

  • us-east-1 (N. Virginia)

  • us-west-2 (Oregon)

  • eu-west-1 (Ireland)

To add a new region, update both the VPC configuration and region-specific cache settings.

Deployment Process

Development Environment

Deploy to dev environment first to validate configuration:

cd cdk
yarn deployElastiCache:iad
yarn deployElastiCache:pdx
yarn deployElastiCache:dub

Production Environment

After testing in dev, Submit an MR:

  1. In the commit message you should include [cache-deployment] to trigger a cache dedicated pipeline. Ref: https://quip-amazon.com/61r8AblahGLI/Metalfly-Cache-Pipeline-Setup

You can refer the CI/CD pipelines defined in the pipelines/ directory.

Testing and Validation

After deployment, verify that:

  1. Cluster is accessible: Test connectivity from appropriate services

  2. Authentication works: If auth is enabled, verify credentials work

  3. Performance meets expectations: Check throughput and latency

  4. Autoscaling functions correctly: Monitor CloudWatch metrics during load tests

Connectivity Testing

You can test connectivity using Redis CLI:

# With auth
redis-cli -h <endpoint without port> -c -a <auth-token> --tls

# Without auth
redis-cli -h <endpoint without port> -c

Common CloudWatch Metrics to Monitor

  • EngineCPUUtilization: Should stay below target (default: 70%)

  • DatabaseMemoryUsageCountedForEvictPercentage: Should stay below target (default: 75%)

  • CurrConnections: Monitor for unexpected connection patterns

  • ReplicationLag: Should be minimal in normal operations

  • GetTypeCmds and SetTypeCmds: Monitor throughput

Troubleshooting

Common Issues

  1. Deployment Failures

    • Issue: Stack creation fails due to parameter errors

    • Solution: Verify all required parameters are correctly set in configurations

  2. Connectivity Issues

    • Issue: Services can’t connect to the cluster

    • Solution: Check security groups, subnet routing, and network ACLs

  3. Authentication Errors

    • Issue: AUTH failures when connecting

    • Solution: Verify auth token in Secrets Manager and client configurations

  4. Cluster Mode Errors

    • Issue: Commands fail with “CROSSSLOT” errors

    • Solution: Ensure client is using cluster-aware driver with appropriate configuration

  5. Performance Issues

    • Issue: High latency or throughput limitations

    • Solution: Check instance type, shard count, and connection patterns

Support Escalation

If you encounter issues that can’t be resolved:

  1. Check internal documentation on Cache Oncall Run-book

  2. Open a service troubleshooting ticket in AWS support console.

  3. Contact the platform team responsible for ElastiCache infrastructure. Slack: https://amazon.enterprise.slack.com/archives/C018P0RTJ1W