# SAP BTP Resilience Reference ## Overview Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience. ## Key Resources | Resource | Description | |----------|-------------| | Developing Resilient Apps on SAP BTP | Patterns and examples | | Route Multi-Region Traffic | GitHub implementation | | Architecting Multi-Region Resiliency | Discovery Center reference | ## Cloud Foundry Resilience ### Availability Zones **Automatic Distribution:** - Applications spread across multiple AZs - No manual configuration required - Platform handles placement **During AZ Failure:** - ~1/3 instances become unavailable (3-zone deployment) - Remaining instances handle increased load - Cloud Foundry reschedules to healthy zones **Best Practice:** Configure sufficient instances to handle load during zone failures: ``` Minimum instances = Normal load instances × 1.5 ``` ### Instance Configuration ```yaml # manifest.yml applications: - name: my-app instances: 3 # At least 3 for HA memory: 512M health-check-type: http health-check-http-endpoint: /health ``` ### Health Checks ```javascript // Express health endpoint app.get('/health', (req, res) => { const health = { status: 'UP', checks: { database: checkDatabase(), messaging: checkMessaging() } }; res.status(200).json(health); }); ``` ## Kyma Resilience ### Istio Service Mesh **Features:** - Automatic retries - Circuit breakers - Timeouts - Load balancing ### Configuration ```yaml apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: my-app-dr spec: host: my-app trafficPolicy: connectionPool: tcp: maxConnections: 100 http: h2UpgradePolicy: UPGRADE http1MaxPendingRequests: 100 http2MaxRequests: 1000 outlierDetection: consecutive5xxErrors: 5 interval: 30s baseEjectionTime: 30s maxEjectionPercent: 50 ``` ### Pod Distribution ```yaml apiVersion: apps/v1 kind: Deployment spec: replicas: 3 template: spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app ``` ## ABAP Resilience ### Built-in Features - Automatic workload distribution - Work process management - HANA failover support - Session management ### Elastic Scaling Automatic response to load: - Scale between 1 ACU and configured max - 0.5 ACU increments - Metrics-based decisions ## Resilience Patterns ### Circuit Breaker **Purpose**: Prevent cascading failures **States:** 1. **Closed**: Normal operation 2. **Open**: Fail fast, skip calls 3. **Half-Open**: Test recovery **Implementation (CAP - Node.js):** > **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling. ```javascript const CircuitBreaker = require('opossum'); const breaker = new CircuitBreaker(callRemoteService, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 }); breaker.fallback(() => getCachedData()); const result = await breaker.fire(serviceParams); ``` ### Retry with Exponential Backoff **Purpose**: Handle transient failures ```javascript async function retryWithBackoff(fn, maxRetries = 3) { for (let i = 0; i < maxRetries; i++) { try { return await fn(); } catch (error) { if (i === maxRetries - 1) throw error; const delay = Math.pow(2, i) * 1000; await new Promise(r => setTimeout(r, delay)); } } } ``` ### Bulkhead **Purpose**: Isolate failures ```javascript const Semaphore = require('semaphore'); const dbPool = Semaphore(10); // Max 10 concurrent DB calls const apiPool = Semaphore(20); // Max 20 concurrent API calls async function callDatabase() { return new Promise((resolve, reject) => { dbPool.take(() => { performDbCall() .then(resolve) .catch(reject) .finally(() => dbPool.leave()); }); }); } ``` ### Timeout **Purpose**: Prevent hanging requests ```javascript const timeout = (promise, ms) => { return Promise.race([ promise, new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), ms) ) ]); }; const result = await timeout(fetchData(), 5000); ``` ### Graceful Degradation **Purpose**: Provide reduced functionality instead of failing ```javascript async function getProductDetails(id) { try { // Try full data return await getFromPrimaryService(id); } catch (error) { // Fallback to cached/reduced data const cached = await getFromCache(id); if (cached) return { ...cached, _degraded: true }; // Final fallback return getBasicDetails(id); } } ``` ## Multi-Region Architecture ### Active-Passive ``` Region A (Primary) Region B (Standby) ↓ ↓ Active Standby ↓ ↓ HANA Cloud HANA Cloud (Replica) ``` **Failover**: Manual or automated switch ### Active-Active ``` Global Load Balancer ↓ ┌─────────┴─────────┐ ↓ ↓ Region A Region B ↓ ↓ HANA Cloud HANA Cloud ↓ ↓ └───── Replication ─┘ ``` **Use Case**: Highest availability requirements ## Monitoring for Resilience ### Key Metrics | Metric | Threshold | Action | |--------|-----------|--------| | Error rate | > 1% | Alert, investigate | | Latency p99 | > 2s | Scale, optimize | | Circuit breaker trips | Any | Review dependencies | | Retry rate | > 5% | Check downstream services | ### Alerting ```yaml # SAP Alert Notification example conditions: - name: high-error-rate propertyKey: error_rate predicate: GREATER_THAN propertyValue: "0.01" actions: - name: page-oncall type: EMAIL properties: destination: oncall@example.com ``` ## Best Practices ### Design 1. **Assume failure** - Everything can fail 2. **Design for graceful degradation** 3. **Implement health checks** 4. **Use async where possible** 5. **Plan for data consistency** ### Implementation 1. **Set timeouts** on all external calls 2. **Implement retries** with backoff 3. **Use circuit breakers** for dependencies 4. **Cache aggressively** where appropriate 5. **Log and monitor** all failures ### Operations 1. **Run chaos engineering** tests 2. **Practice disaster recovery** 3. **Monitor SLIs/SLOs** 4. **Automate failover** where possible ## Source Documentation - Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md) - SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)