Files
2025-11-30 08:55:02 +08:00

7.3 KiB
Raw Permalink Blame History

SAP BTP Resilience Reference

Overview

Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.

Key Resources

Resource Description
Developing Resilient Apps on SAP BTP Patterns and examples
Route Multi-Region Traffic GitHub implementation
Architecting Multi-Region Resiliency Discovery Center reference

Cloud Foundry Resilience

Availability Zones

Automatic Distribution:

  • Applications spread across multiple AZs
  • No manual configuration required
  • Platform handles placement

During AZ Failure:

  • ~1/3 instances become unavailable (3-zone deployment)
  • Remaining instances handle increased load
  • Cloud Foundry reschedules to healthy zones

Best Practice: Configure sufficient instances to handle load during zone failures:

Minimum instances = Normal load instances × 1.5

Instance Configuration

# manifest.yml
applications:
  - name: my-app
    instances: 3  # At least 3 for HA
    memory: 512M
    health-check-type: http
    health-check-http-endpoint: /health

Health Checks

// Express health endpoint
app.get('/health', (req, res) => {
  const health = {
    status: 'UP',
    checks: {
      database: checkDatabase(),
      messaging: checkMessaging()
    }
  };
  res.status(200).json(health);
});

Kyma Resilience

Istio Service Mesh

Features:

  • Automatic retries
  • Circuit breakers
  • Timeouts
  • Load balancing

Configuration

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-app-dr
spec:
  host: my-app
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Pod Distribution

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: my-app

ABAP Resilience

Built-in Features

  • Automatic workload distribution
  • Work process management
  • HANA failover support
  • Session management

Elastic Scaling

Automatic response to load:

  • Scale between 1 ACU and configured max
  • 0.5 ACU increments
  • Metrics-based decisions

Resilience Patterns

Circuit Breaker

Purpose: Prevent cascading failures

States:

  1. Closed: Normal operation
  2. Open: Fail fast, skip calls
  3. Half-Open: Test recovery

Implementation (CAP - Node.js):

Note

: The opossum library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.

const CircuitBreaker = require('opossum');

const breaker = new CircuitBreaker(callRemoteService, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

breaker.fallback(() => getCachedData());

const result = await breaker.fire(serviceParams);

Retry with Exponential Backoff

Purpose: Handle transient failures

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = Math.pow(2, i) * 1000;
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Bulkhead

Purpose: Isolate failures

const Semaphore = require('semaphore');

const dbPool = Semaphore(10);  // Max 10 concurrent DB calls
const apiPool = Semaphore(20); // Max 20 concurrent API calls

async function callDatabase() {
  return new Promise((resolve, reject) => {
    dbPool.take(() => {
      performDbCall()
        .then(resolve)
        .catch(reject)
        .finally(() => dbPool.leave());
    });
  });
}

Timeout

Purpose: Prevent hanging requests

const timeout = (promise, ms) => {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), ms)
    )
  ]);
};

const result = await timeout(fetchData(), 5000);

Graceful Degradation

Purpose: Provide reduced functionality instead of failing

async function getProductDetails(id) {
  try {
    // Try full data
    return await getFromPrimaryService(id);
  } catch (error) {
    // Fallback to cached/reduced data
    const cached = await getFromCache(id);
    if (cached) return { ...cached, _degraded: true };

    // Final fallback
    return getBasicDetails(id);
  }
}

Multi-Region Architecture

Active-Passive

Region A (Primary)     Region B (Standby)
     ↓                      ↓
  Active                 Standby
     ↓                      ↓
  HANA Cloud           HANA Cloud (Replica)

Failover: Manual or automated switch

Active-Active

        Global Load Balancer
              ↓
    ┌─────────┴─────────┐
    ↓                   ↓
Region A              Region B
    ↓                   ↓
HANA Cloud           HANA Cloud
    ↓                   ↓
    └───── Replication ─┘

Use Case: Highest availability requirements

Monitoring for Resilience

Key Metrics

Metric Threshold Action
Error rate > 1% Alert, investigate
Latency p99 > 2s Scale, optimize
Circuit breaker trips Any Review dependencies
Retry rate > 5% Check downstream services

Alerting

# SAP Alert Notification example
conditions:
  - name: high-error-rate
    propertyKey: error_rate
    predicate: GREATER_THAN
    propertyValue: "0.01"

actions:
  - name: page-oncall
    type: EMAIL
    properties:
      destination: oncall@example.com

Best Practices

Design

  1. Assume failure - Everything can fail
  2. Design for graceful degradation
  3. Implement health checks
  4. Use async where possible
  5. Plan for data consistency

Implementation

  1. Set timeouts on all external calls
  2. Implement retries with backoff
  3. Use circuit breakers for dependencies
  4. Cache aggressively where appropriate
  5. Log and monitor all failures

Operations

  1. Run chaos engineering tests
  2. Practice disaster recovery
  3. Monitor SLIs/SLOs
  4. Automate failover where possible

Source Documentation