Building Resilient Microservices: Patterns and Practices for Production Systems

Microservices architecture has become the de facto standard for building scalable, maintainable applications. However, the distributed nature of microservices introduces new challenges around reliability, consistency, and fault tolerance. Building truly resilient microservices requires implementing proven patterns and practices that help systems gracefully handle failures and maintain availability.

This comprehensive guide explores the essential patterns, tools, and practices for building microservices that can withstand the inevitable failures in distributed systems. We'll cover everything from circuit breakers and retry mechanisms to observability and chaos engineering.

Understanding Resilience in Distributed Systems

The Fallacies of Distributed Computing

Before diving into resilience patterns, it's crucial to understand the fundamental challenges of distributed systems. The "Fallacies of Distributed Computing" highlight assumptions that developers often make but shouldn't:

The network is reliable - Networks fail, packets get lost, and latency varies
Latency is zero - Network calls have inherent latency that affects performance
Bandwidth is infinite - Network capacity is limited and shared
The network is secure - Security must be built into every layer
Topology doesn't change - Network topology evolves constantly
There is one administrator - Multiple teams manage different parts of the system
Transport cost is zero - Network operations have real costs
The network is homogeneous - Different protocols, formats, and systems coexist

Understanding these fallacies helps us design systems that expect and handle failures gracefully.

Defining Resilience

Resilience in microservices encompasses several key characteristics:

Fault Tolerance: The ability to continue operating when components fail
Self-Healing: Automatic recovery from transient failures
Graceful Degradation: Maintaining core functionality when non-critical services fail
Adaptive Capacity: Learning from failures and improving over time
Observability: Understanding system behavior and health in real-time

Core Resilience Patterns

Circuit Breaker Pattern

The circuit breaker pattern prevents cascading failures by monitoring service calls and "opening" when failure rates exceed thresholds:

interface CircuitBreakerConfig {
  failureThreshold: number;
  recoveryTimeout: number;
  monitoringPeriod: number;
}

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

class CircuitBreaker {
  private state: CircuitState = 'CLOSED';
  private failureCount = 0;
  private lastFailureTime = 0;
  private successCount = 0;

  constructor(private config: CircuitBreakerConfig) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (this.shouldAttemptReset()) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= 3) { // Require multiple successes
        this.state = 'CLOSED';
        this.successCount = 0;
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    
    if (this.failureCount >= this.config.failureThreshold) {
      this.state = 'OPEN';
    }
  }

  private shouldAttemptReset(): boolean {
    return Date.now() - this.lastFailureTime >= this.config.recoveryTimeout;
  }

  getState(): CircuitState {
    return this.state;
  }
}

// Usage example
const circuitBreaker = new CircuitBreaker({
  failureThreshold: 5,
  recoveryTimeout: 30000, // 30 seconds
  monitoringPeriod: 60000  // 1 minute
});

const userService = {
  async getUser(id: string) {
    return circuitBreaker.execute(async () => {
      const response = await fetch(`/api/users/${id}`);
      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }
      return response.json();
    });
  }
};

Retry Pattern with Exponential Backoff

Implement intelligent retry mechanisms that avoid overwhelming failing services:

interface RetryConfig {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  backoffMultiplier: number;
  jitter: boolean;
}

class RetryPolicy {
  constructor(private config: RetryConfig) {}

  async execute<T>(
    operation: () => Promise<T>,
    isRetryable: (error: Error) => boolean = () => true
  ): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= this.config.maxAttempts; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error as Error;
        
        if (attempt === this.config.maxAttempts || !isRetryable(lastError)) {
          throw lastError;
        }
        
        const delay = this.calculateDelay(attempt);
        await this.sleep(delay);
      }
    }
    
    throw lastError!;
  }

  private calculateDelay(attempt: number): number {
    const exponentialDelay = this.config.baseDelay * 
      Math.pow(this.config.backoffMultiplier, attempt - 1);
    
    let delay = Math.min(exponentialDelay, this.config.maxDelay);
    
    if (this.config.jitter) {
      // Add random jitter to prevent thundering herd
      delay = delay * (0.5 + Math.random() * 0.5);
    }
    
    return delay;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage with different retry strategies
const networkRetry = new RetryPolicy({
  maxAttempts: 3,
  baseDelay: 1000,
  maxDelay: 10000,
  backoffMultiplier: 2,
  jitter: true
});

const databaseRetry = new RetryPolicy({
  maxAttempts: 5,
  baseDelay: 500,
  maxDelay: 5000,
  backoffMultiplier: 1.5,
  jitter: true
});

// Retry only on specific errors
const isRetryableError = (error: Error): boolean => {
  if (error.message.includes('ECONNRESET')) return true;
  if (error.message.includes('timeout')) return true;
  if (error.message.includes('503')) return true;
  return false;
};

const result = await networkRetry.execute(
  () => fetch('/api/data').then(r => r.json()),
  isRetryableError
);

Bulkhead Pattern

Isolate critical resources to prevent failures from spreading:

class ResourcePool<T> {
  private available: T[] = [];
  private inUse = new Set<T>();
  private waiting: Array<{
    resolve: (resource: T) => void;
    reject: (error: Error) => void;
    timeout: NodeJS.Timeout;
  }> = [];

  constructor(
    private resources: T[],
    private maxWaitTime = 5000
  ) {
    this.available = [...resources];
  }

  async acquire(): Promise<T> {
    if (this.available.length > 0) {
      const resource = this.available.pop()!;
      this.inUse.add(resource);
      return resource;
    }

    return new Promise((resolve, reject) => {
      const timeout = setTimeout(() => {
        const index = this.waiting.findIndex(w => w.resolve === resolve);
        if (index >= 0) {
          this.waiting.splice(index, 1);
        }
        reject(new Error('Resource acquisition timeout'));
      }, this.maxWaitTime);

      this.waiting.push({ resolve, reject, timeout });
    });
  }

  release(resource: T): void {
    if (!this.inUse.has(resource)) {
      throw new Error('Resource not in use');
    }

    this.inUse.delete(resource);

    if (this.waiting.length > 0) {
      const waiter = this.waiting.shift()!;
      clearTimeout(waiter.timeout);
      this.inUse.add(resource);
      waiter.resolve(resource);
    } else {
      this.available.push(resource);
    }
  }

  async withResource<R>(fn: (resource: T) => Promise<R>): Promise<R> {
    const resource = await this.acquire();
    try {
      return await fn(resource);
    } finally {
      this.release(resource);
    }
  }

  getStats() {
    return {
      total: this.resources.length,
      available: this.available.length,
      inUse: this.inUse.size,
      waiting: this.waiting.length
    };
  }
}

// Database connection pool example
interface DatabaseConnection {
  query(sql: string): Promise<any>;
  close(): Promise<void>;
}

class DatabaseService {
  private criticalPool: ResourcePool<DatabaseConnection>;
  private generalPool: ResourcePool<DatabaseConnection>;

  constructor(
    criticalConnections: DatabaseConnection[],
    generalConnections: DatabaseConnection[]
  ) {
    this.criticalPool = new ResourcePool(criticalConnections);
    this.generalPool = new ResourcePool(generalConnections);
  }

  async executeCriticalQuery(sql: string): Promise<any> {
    return this.criticalPool.withResource(async (conn) => {
      return conn.query(sql);
    });
  }

  async executeGeneralQuery(sql: string): Promise<any> {
    return this.generalPool.withResource(async (conn) => {
      return conn.query(sql);
    });
  }

  getPoolStats() {
    return {
      critical: this.criticalPool.getStats(),
      general: this.generalPool.getStats()
    };
  }
}

Timeout and Rate Limiting

Comprehensive Timeout Management

Implement timeouts at multiple levels to prevent resource exhaustion:

class TimeoutManager {
  static async withTimeout<T>(
    promise: Promise<T>,
    timeoutMs: number,
    timeoutMessage = 'Operation timed out'
  ): Promise<T> {
    const timeoutPromise = new Promise<never>((_, reject) => {
      setTimeout(() => reject(new Error(timeoutMessage)), timeoutMs);
    });

    return Promise.race([promise, timeoutPromise]);
  }

  static async withRetryAndTimeout<T>(
    operation: () => Promise<T>,
    retryConfig: RetryConfig,
    timeoutMs: number
  ): Promise<T> {
    const retryPolicy = new RetryPolicy(retryConfig);
    
    return this.withTimeout(
      retryPolicy.execute(operation),
      timeoutMs
    );
  }
}

// HTTP client with comprehensive timeout handling
class ResilientHttpClient {
  private circuitBreakers = new Map<string, CircuitBreaker>();

  async request(url: string, options: RequestInit = {}): Promise<Response> {
    const domain = new URL(url).hostname;
    const circuitBreaker = this.getCircuitBreaker(domain);

    return circuitBreaker.execute(async () => {
      const controller = new AbortController();
      const timeoutId = setTimeout(() => controller.abort(), 10000);

      try {
        const response = await fetch(url, {
          ...options,
          signal: controller.signal
        });

        if (!response.ok) {
          throw new Error(`HTTP ${response.status}: ${response.statusText}`);
        }

        return response;
      } finally {
        clearTimeout(timeoutId);
      }
    });
  }

  private getCircuitBreaker(domain: string): CircuitBreaker {
    if (!this.circuitBreakers.has(domain)) {
      this.circuitBreakers.set(domain, new CircuitBreaker({
        failureThreshold: 5,
        recoveryTimeout: 30000,
        monitoringPeriod: 60000
      }));
    }
    return this.circuitBreakers.get(domain)!;
  }
}

Rate Limiting Implementation

Protect services from being overwhelmed:

interface RateLimitConfig {
  windowSizeMs: number;
  maxRequests: number;
}

class SlidingWindowRateLimit {
  private requests: number[] = [];

  constructor(private config: RateLimitConfig) {}

  isAllowed(): boolean {
    const now = Date.now();
    const windowStart = now - this.config.windowSizeMs;

    // Remove old requests outside the window
    this.requests = this.requests.filter(time => time > windowStart);

    if (this.requests.length < this.config.maxRequests) {
      this.requests.push(now);
      return true;
    }

    return false;
  }

  getStats() {
    const now = Date.now();
    const windowStart = now - this.config.windowSizeMs;
    const currentRequests = this.requests.filter(time => time > windowStart);
    
    return {
      currentRequests: currentRequests.length,
      maxRequests: this.config.maxRequests,
      windowSizeMs: this.config.windowSizeMs,
      resetTime: Math.min(...this.requests) + this.config.windowSizeMs
    };
  }
}

// Token bucket rate limiter
class TokenBucketRateLimit {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private capacity: number,
    private refillRate: number // tokens per second
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  isAllowed(tokensRequested = 1): boolean {
    this.refill();

    if (this.tokens >= tokensRequested) {
      this.tokens -= tokensRequested;
      return true;
    }

    return false;
  }

  private refill(): void {
    const now = Date.now();
    const timePassed = (now - this.lastRefill) / 1000;
    const tokensToAdd = timePassed * this.refillRate;
    
    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
    this.lastRefill = now;
  }

  getStats() {
    this.refill();
    return {
      availableTokens: Math.floor(this.tokens),
      capacity: this.capacity,
      refillRate: this.refillRate
    };
  }
}

Health Checks and Monitoring

Comprehensive Health Check System

Implement multi-level health checks for better observability:

type HealthStatus = 'healthy' | 'degraded' | 'unhealthy';

interface HealthCheck {
  name: string;
  check(): Promise<HealthCheckResult>;
}

interface HealthCheckResult {
  status: HealthStatus;
  message?: string;
  details?: Record<string, any>;
  duration: number;
}

class HealthCheckManager {
  private checks = new Map<string, HealthCheck>();
  private cache = new Map<string, { result: HealthCheckResult; timestamp: number }>();
  private readonly cacheTimeout = 30000; // 30 seconds

  register(check: HealthCheck): void {
    this.checks.set(check.name, check);
  }

  async runCheck(name: string): Promise<HealthCheckResult> {
    const check = this.checks.get(name);
    if (!check) {
      throw new Error(`Health check '${name}' not found`);
    }

    const cached = this.cache.get(name);
    if (cached && Date.now() - cached.timestamp < this.cacheTimeout) {
      return cached.result;
    }

    const startTime = Date.now();
    try {
      const result = await TimeoutManager.withTimeout(
        check.check(),
        5000,
        'Health check timeout'
      );
      
      result.duration = Date.now() - startTime;
      this.cache.set(name, { result, timestamp: Date.now() });
      return result;
    } catch (error) {
      const result: HealthCheckResult = {
        status: 'unhealthy',
        message: error instanceof Error ? error.message : 'Unknown error',
        duration: Date.now() - startTime
      };
      
      this.cache.set(name, { result, timestamp: Date.now() });
      return result;
    }
  }

  async runAllChecks(): Promise<Record<string, HealthCheckResult>> {
    const results: Record<string, HealthCheckResult> = {};
    
    await Promise.allSettled(
      Array.from(this.checks.keys()).map(async (name) => {
        try {
          results[name] = await this.runCheck(name);
        } catch (error) {
          results[name] = {
            status: 'unhealthy',
            message: error instanceof Error ? error.message : 'Unknown error',
            duration: 0
          };
        }
      })
    );

    return results;
  }

  getOverallStatus(results: Record<string, HealthCheckResult>): HealthStatus {
    const statuses = Object.values(results).map(r => r.status);
    
    if (statuses.every(s => s === 'healthy')) {
      return 'healthy';
    }
    
    if (statuses.some(s => s === 'unhealthy')) {
      return 'unhealthy';
    }
    
    return 'degraded';
  }
}

// Example health checks
class DatabaseHealthCheck implements HealthCheck {
  name = 'database';
  
  constructor(private db: DatabaseConnection) {}
  
  async check(): Promise<HealthCheckResult> {
    try {
      await this.db.query('SELECT 1');
      return {
        status: 'healthy',
        message: 'Database connection successful',
        duration: 0
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `Database connection failed: ${error}`,
        duration: 0
      };
    }
  }
}

class ExternalServiceHealthCheck implements HealthCheck {
  name = 'external-api';
  
  constructor(private httpClient: ResilientHttpClient, private url: string) {}
  
  async check(): Promise<HealthCheckResult> {
    try {
      const response = await this.httpClient.request(`${this.url}/health`);
      return {
        status: 'healthy',
        message: 'External service is available',
        details: { statusCode: response.status },
        duration: 0
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `External service unavailable: ${error}`,
        duration: 0
      };
    }
  }
}

Observability and Metrics

Structured Logging

Implement comprehensive logging for distributed systems:

interface LogContext {
  traceId?: string;
  spanId?: string;
  userId?: string;
  requestId?: string;
  service: string;
  version: string;
}

type LogLevel = 'debug' | 'info' | 'warn' | 'error' | 'fatal';

interface LogEntry {
  timestamp: string;
  level: LogLevel;
  message: string;
  context: LogContext;
  metadata?: Record<string, any>;
  error?: {
    name: string;
    message: string;
    stack?: string;
  };
}

class StructuredLogger {
  constructor(private baseContext: LogContext) {}

  private createLogEntry(
    level: LogLevel,
    message: string,
    metadata?: Record<string, any>,
    error?: Error
  ): LogEntry {
    return {
      timestamp: new Date().toISOString(),
      level,
      message,
      context: { ...this.baseContext },
      metadata,
      error: error ? {
        name: error.name,
        message: error.message,
        stack: error.stack
      } : undefined
    };
  }

  debug(message: string, metadata?: Record<string, any>): void {
    this.log(this.createLogEntry('debug', message, metadata));
  }

  info(message: string, metadata?: Record<string, any>): void {
    this.log(this.createLogEntry('info', message, metadata));
  }

  warn(message: string, metadata?: Record<string, any>): void {
    this.log(this.createLogEntry('warn', message, metadata));
  }

  error(message: string, error?: Error, metadata?: Record<string, any>): void {
    this.log(this.createLogEntry('error', message, metadata, error));
  }

  fatal(message: string, error?: Error, metadata?: Record<string, any>): void {
    this.log(this.createLogEntry('fatal', message, metadata, error));
  }

  withContext(additionalContext: Partial<LogContext>): StructuredLogger {
    return new StructuredLogger({
      ...this.baseContext,
      ...additionalContext
    });
  }

  private log(entry: LogEntry): void {
    // In production, send to logging service (ELK, Splunk, etc.)
    console.log(JSON.stringify(entry));
  }
}

// Metrics collection
interface MetricPoint {
  name: string;
  value: number;
  timestamp: number;
  tags: Record<string, string>;
}

class MetricsCollector {
  private metrics: MetricPoint[] = [];
  private counters = new Map<string, number>();
  private gauges = new Map<string, number>();
  private histograms = new Map<string, number[]>();

  counter(name: string, tags: Record<string, string> = {}): void {
    const key = `${name}:${JSON.stringify(tags)}`;
    this.counters.set(key, (this.counters.get(key) || 0) + 1);
    
    this.addMetric({
      name: `${name}.count`,
      value: this.counters.get(key)!,
      timestamp: Date.now(),
      tags
    });
  }

  gauge(name: string, value: number, tags: Record<string, string> = {}): void {
    const key = `${name}:${JSON.stringify(tags)}`;
    this.gauges.set(key, value);
    
    this.addMetric({
      name,
      value,
      timestamp: Date.now(),
      tags
    });
  }

  histogram(name: string, value: number, tags: Record<string, string> = {}): void {
    const key = `${name}:${JSON.stringify(tags)}`;
    const values = this.histograms.get(key) || [];
    values.push(value);
    this.histograms.set(key, values);
    
    // Calculate percentiles
    const sorted = [...values].sort((a, b) => a - b);
    const p50 = this.percentile(sorted, 0.5);
    const p95 = this.percentile(sorted, 0.95);
    const p99 = this.percentile(sorted, 0.99);
    
    this.addMetric({ name: `${name}.p50`, value: p50, timestamp: Date.now(), tags });
    this.addMetric({ name: `${name}.p95`, value: p95, timestamp: Date.now(), tags });
    this.addMetric({ name: `${name}.p99`, value: p99, timestamp: Date.now(), tags });
  }

  private percentile(sorted: number[], p: number): number {
    const index = Math.ceil(sorted.length * p) - 1;
    return sorted[Math.max(0, index)];
  }

  private addMetric(metric: MetricPoint): void {
    this.metrics.push(metric);
    
    // In production, send to metrics service (Prometheus, DataDog, etc.)
    if (this.metrics.length > 1000) {
      this.flush();
    }
  }

  flush(): void {
    // Send metrics to external service
    console.log(`Flushing ${this.metrics.length} metrics`);
    this.metrics.length = 0;
  }
}

Deployment and Infrastructure Patterns

Blue-Green Deployment Strategy

Implement zero-downtime deployments:

interface DeploymentEnvironment {
  name: string;
  version: string;
  healthEndpoint: string;
  instances: string[];
}

class BlueGreenDeployment {
  constructor(
    private logger: StructuredLogger,
    private healthChecker: HealthCheckManager,
    private loadBalancer: LoadBalancer
  ) {}

  async deploy(
    currentEnv: DeploymentEnvironment,
    newEnv: DeploymentEnvironment
  ): Promise<void> {
    this.logger.info('Starting blue-green deployment', {
      from: currentEnv.version,
      to: newEnv.version
    });

    try {
      // Step 1: Deploy to inactive environment
      await this.deployToEnvironment(newEnv);
      
      // Step 2: Health check new environment
      await this.waitForHealthy(newEnv);
      
      // Step 3: Run smoke tests
      await this.runSmokeTests(newEnv);
      
      // Step 4: Switch traffic gradually
      await this.switchTraffic(currentEnv, newEnv);
      
      // Step 5: Monitor and validate
      await this.monitorDeployment(newEnv);
      
      this.logger.info('Blue-green deployment completed successfully');
      
    } catch (error) {
      this.logger.error('Deployment failed, rolling back', error);
      await this.rollback(currentEnv, newEnv);
      throw error;
    }
  }

  private async deployToEnvironment(env: DeploymentEnvironment): Promise<void> {
    this.logger.info(`Deploying to ${env.name}`, { version: env.version });
    
    // Deploy application to all instances
    await Promise.all(
      env.instances.map(instance => this.deployToInstance(instance, env.version))
    );
  }

  private async waitForHealthy(env: DeploymentEnvironment): Promise<void> {
    const maxAttempts = 30;
    const delay = 10000; // 10 seconds
    
    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
        const healthy = await this.checkEnvironmentHealth(env);
        if (healthy) {
          this.logger.info(`Environment ${env.name} is healthy`);
          return;
        }
      } catch (error) {
        this.logger.warn(`Health check attempt ${attempt} failed`, { error });
      }
      
      if (attempt < maxAttempts) {
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }
    
    throw new Error(`Environment ${env.name} failed to become healthy`);
  }

  private async switchTraffic(
    oldEnv: DeploymentEnvironment,
    newEnv: DeploymentEnvironment
  ): Promise<void> {
    const steps = [10, 25, 50, 75, 100]; // Percentage of traffic to new environment
    
    for (const percentage of steps) {
      this.logger.info(`Switching ${percentage}% traffic to new environment`);
      
      await this.loadBalancer.setTrafficSplit({
        [oldEnv.name]: 100 - percentage,
        [newEnv.name]: percentage
      });
      
      // Wait and monitor
      await new Promise(resolve => setTimeout(resolve, 60000)); // 1 minute
      
      const metrics = await this.collectMetrics(newEnv);
      if (metrics.errorRate > 0.01) { // 1% error rate threshold
        throw new Error(`High error rate detected: ${metrics.errorRate}`);
      }
    }
  }

  private async rollback(
    oldEnv: DeploymentEnvironment,
    newEnv: DeploymentEnvironment
  ): Promise<void> {
    this.logger.info('Rolling back deployment');
    
    await this.loadBalancer.setTrafficSplit({
      [oldEnv.name]: 100,
      [newEnv.name]: 0
    });
    
    this.logger.info('Rollback completed');
  }

  // Implementation stubs
  private async deployToInstance(instance: string, version: string): Promise<void> {
    // Implementation depends on deployment platform
  }

  private async checkEnvironmentHealth(env: DeploymentEnvironment): Promise<boolean> {
    // Check health of all instances
    return true;
  }

  private async runSmokeTests(env: DeploymentEnvironment): Promise<void> {
    // Run critical path tests
  }

  private async monitorDeployment(env: DeploymentEnvironment): Promise<void> {
    // Monitor key metrics for specified duration
  }

  private async collectMetrics(env: DeploymentEnvironment): Promise<{ errorRate: number }> {
    // Collect and analyze metrics
    return { errorRate: 0 };
  }
}

Chaos Engineering

Implementing Chaos Testing

Proactively test system resilience:

interface ChaosExperiment {
  name: string;
  description: string;
  execute(): Promise<void>;
  cleanup(): Promise<void>;
}

class NetworkLatencyExperiment implements ChaosExperiment {
  name = 'network-latency';
  description = 'Introduces network latency to test timeout handling';
  
  constructor(
    private targetService: string,
    private latencyMs: number,
    private duration: number
  ) {}

  async execute(): Promise<void> {
    console.log(`Introducing ${this.latencyMs}ms latency to ${this.targetService}`);
    // Implementation would use tools like tc (traffic control) or toxiproxy
  }

  async cleanup(): Promise<void> {
    console.log(`Removing latency from ${this.targetService}`);
    // Remove the latency injection
  }
}

class ServiceFailureExperiment implements ChaosExperiment {
  name = 'service-failure';
  description = 'Simulates complete service failure';
  
  constructor(
    private targetService: string,
    private duration: number
  ) {}

  async execute(): Promise<void> {
    console.log(`Stopping ${this.targetService} for ${this.duration}ms`);
    // Implementation would stop the service or block traffic
  }

  async cleanup(): Promise<void> {
    console.log(`Restoring ${this.targetService}`);
    // Restore the service
  }
}

class ChaosEngineer {
  constructor(
    private logger: StructuredLogger,
    private metrics: MetricsCollector
  ) {}

  async runExperiment(
    experiment: ChaosExperiment,
    monitoringDuration: number = 300000 // 5 minutes
  ): Promise<void> {
    this.logger.info('Starting chaos experiment', {
      experiment: experiment.name,
      description: experiment.description
    });

    const startTime = Date.now();
    
    try {
      // Collect baseline metrics
      const baseline = await this.collectBaselineMetrics();
      
      // Execute the experiment
      await experiment.execute();
      
      // Monitor system behavior
      const results = await this.monitorSystem(monitoringDuration);
      
      // Analyze results
      const analysis = this.analyzeResults(baseline, results);
      
      this.logger.info('Chaos experiment completed', {
        experiment: experiment.name,
        duration: Date.now() - startTime,
        analysis
      });
      
    } catch (error) {
      this.logger.error('Chaos experiment failed', error, {
        experiment: experiment.name
      });
      throw error;
    } finally {
      // Always cleanup
      await experiment.cleanup();
    }
  }

  private async collectBaselineMetrics(): Promise<Record<string, number>> {
    // Collect system metrics before experiment
    return {
      responseTime: 100,
      errorRate: 0.001,
      throughput: 1000
    };
  }

  private async monitorSystem(duration: number): Promise<Record<string, number>> {
    // Monitor system during experiment
    return new Promise(resolve => {
      setTimeout(() => {
        resolve({
          responseTime: 150,
          errorRate: 0.005,
          throughput: 950
        });
      }, duration);
    });
  }

  private analyzeResults(
    baseline: Record<string, number>,
    results: Record<string, number>
  ): Record<string, any> {
    return {
      responseTimeIncrease: ((results.responseTime - baseline.responseTime) / baseline.responseTime) * 100,
      errorRateIncrease: ((results.errorRate - baseline.errorRate) / baseline.errorRate) * 100,
      throughputDecrease: ((baseline.throughput - results.throughput) / baseline.throughput) * 100
    };
  }
}

Conclusion

Building resilient microservices requires a comprehensive approach that addresses failures at every level of the system. The patterns and practices outlined in this guide provide a foundation for creating systems that can handle the inevitable failures in distributed environments.

Key takeaways for building resilient microservices:

Embrace Failure: Design systems that expect and handle failures gracefully
Implement Defense in Depth: Use multiple resilience patterns together
Monitor Everything: Comprehensive observability is crucial for understanding system behavior
Test Resilience: Use chaos engineering to validate your resilience mechanisms
Automate Recovery: Implement self-healing capabilities where possible
Plan for Degradation: Design graceful degradation paths for non-critical functionality

Resilience is not a destination but a journey. Continuously evaluate and improve your systems' ability to handle failures. Regular chaos engineering exercises, thorough monitoring, and post-incident reviews help identify weaknesses and improve overall system resilience.

Remember that the goal is not to prevent all failures—that's impossible in distributed systems. Instead, focus on building systems that can detect, isolate, and recover from failures quickly while maintaining acceptable service levels for your users. The investment in resilience patterns pays dividends in reduced downtime, improved user experience, and increased confidence in your system's reliability.