Initial commit

2025-11-30 08:55:02 +08:00
commit 6942e32e6b
26 changed files with 7173 additions and 0 deletions
--- a/references/resilience.md
+++ b/references/resilience.md
@@ -0,0 +1,323 @@
+# SAP BTP Resilience Reference
+
+## Overview
+
+Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
+
+## Key Resources
+
+| Resource | Description |
+|----------|-------------|
+| Developing Resilient Apps on SAP BTP | Patterns and examples |
+| Route Multi-Region Traffic | GitHub implementation |
+| Architecting Multi-Region Resiliency | Discovery Center reference |
+
+## Cloud Foundry Resilience
+
+### Availability Zones
+
+**Automatic Distribution:**
+- Applications spread across multiple AZs
+- No manual configuration required
+- Platform handles placement
+
+**During AZ Failure:**
+- ~1/3 instances become unavailable (3-zone deployment)
+- Remaining instances handle increased load
+- Cloud Foundry reschedules to healthy zones
+
+**Best Practice:**
+Configure sufficient instances to handle load during zone failures:
+```
+Minimum instances = Normal load instances × 1.5
+```
+
+### Instance Configuration
+
+```yaml
+# manifest.yml
+applications:
+  - name: my-app
+    instances: 3  # At least 3 for HA
+    memory: 512M
+    health-check-type: http
+    health-check-http-endpoint: /health
+```
+
+### Health Checks
+
+```javascript
+// Express health endpoint
+app.get('/health', (req, res) => {
+  const health = {
+    status: 'UP',
+    checks: {
+      database: checkDatabase(),
+      messaging: checkMessaging()
+    }
+  };
+  res.status(200).json(health);
+});
+```
+
+## Kyma Resilience
+
+### Istio Service Mesh
+
+**Features:**
+- Automatic retries
+- Circuit breakers
+- Timeouts
+- Load balancing
+
+### Configuration
+
+```yaml
+apiVersion: networking.istio.io/v1alpha3
+kind: DestinationRule
+metadata:
+  name: my-app-dr
+spec:
+  host: my-app
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        h2UpgradePolicy: UPGRADE
+        http1MaxPendingRequests: 100
+        http2MaxRequests: 1000
+    outlierDetection:
+      consecutive5xxErrors: 5
+      interval: 30s
+      baseEjectionTime: 30s
+      maxEjectionPercent: 50
+```
+
+### Pod Distribution
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+spec:
+  replicas: 3
+  template:
+    spec:
+      topologySpreadConstraints:
+        - maxSkew: 1
+          topologyKey: topology.kubernetes.io/zone
+          whenUnsatisfiable: DoNotSchedule
+          labelSelector:
+            matchLabels:
+              app: my-app
+```
+
+## ABAP Resilience
+
+### Built-in Features
+
+- Automatic workload distribution
+- Work process management
+- HANA failover support
+- Session management
+
+### Elastic Scaling
+
+Automatic response to load:
+- Scale between 1 ACU and configured max
+- 0.5 ACU increments
+- Metrics-based decisions
+
+## Resilience Patterns
+
+### Circuit Breaker
+
+**Purpose**: Prevent cascading failures
+
+**States:**
+1. **Closed**: Normal operation
+2. **Open**: Fail fast, skip calls
+3. **Half-Open**: Test recovery
+
+**Implementation (CAP - Node.js):**
+
+> **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
+
+```javascript
+const CircuitBreaker = require('opossum');
+
+const breaker = new CircuitBreaker(callRemoteService, {
+  timeout: 3000,
+  errorThresholdPercentage: 50,
+  resetTimeout: 30000
+});
+
+breaker.fallback(() => getCachedData());
+
+const result = await breaker.fire(serviceParams);
+```
+
+### Retry with Exponential Backoff
+
+**Purpose**: Handle transient failures
+
+```javascript
+async function retryWithBackoff(fn, maxRetries = 3) {
+  for (let i = 0; i < maxRetries; i++) {
+    try {
+      return await fn();
+    } catch (error) {
+      if (i === maxRetries - 1) throw error;
+      const delay = Math.pow(2, i) * 1000;
+      await new Promise(r => setTimeout(r, delay));
+    }
+  }
+}
+```
+
+### Bulkhead
+
+**Purpose**: Isolate failures
+
+```javascript
+const Semaphore = require('semaphore');
+
+const dbPool = Semaphore(10);  // Max 10 concurrent DB calls
+const apiPool = Semaphore(20); // Max 20 concurrent API calls
+
+async function callDatabase() {
+  return new Promise((resolve, reject) => {
+    dbPool.take(() => {
+      performDbCall()
+        .then(resolve)
+        .catch(reject)
+        .finally(() => dbPool.leave());
+    });
+  });
+}
+```
+
+### Timeout
+
+**Purpose**: Prevent hanging requests
+
+```javascript
+const timeout = (promise, ms) => {
+  return Promise.race([
+    promise,
+    new Promise((_, reject) =>
+      setTimeout(() => reject(new Error('Timeout')), ms)
+    )
+  ]);
+};
+
+const result = await timeout(fetchData(), 5000);
+```
+
+### Graceful Degradation
+
+**Purpose**: Provide reduced functionality instead of failing
+
+```javascript
+async function getProductDetails(id) {
+  try {
+    // Try full data
+    return await getFromPrimaryService(id);
+  } catch (error) {
+    // Fallback to cached/reduced data
+    const cached = await getFromCache(id);
+    if (cached) return { ...cached, _degraded: true };
+
+    // Final fallback
+    return getBasicDetails(id);
+  }
+}
+```
+
+## Multi-Region Architecture
+
+### Active-Passive
+
+```
+Region A (Primary)     Region B (Standby)
+     ↓                      ↓
+  Active                 Standby
+     ↓                      ↓
+  HANA Cloud           HANA Cloud (Replica)
+```
+
+**Failover**: Manual or automated switch
+
+### Active-Active
+
+```
+        Global Load Balancer
+              ↓
+    ┌─────────┴─────────┐
+    ↓                   ↓
+Region A              Region B
+    ↓                   ↓
+HANA Cloud           HANA Cloud
+    ↓                   ↓
+    └───── Replication ─┘
+```
+
+**Use Case**: Highest availability requirements
+
+## Monitoring for Resilience
+
+### Key Metrics
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| Error rate | > 1% | Alert, investigate |
+| Latency p99 | > 2s | Scale, optimize |
+| Circuit breaker trips | Any | Review dependencies |
+| Retry rate | > 5% | Check downstream services |
+
+### Alerting
+
+```yaml
+# SAP Alert Notification example
+conditions:
+  - name: high-error-rate
+    propertyKey: error_rate
+    predicate: GREATER_THAN
+    propertyValue: "0.01"
+
+actions:
+  - name: page-oncall
+    type: EMAIL
+    properties:
+      destination: oncall@example.com
+```
+
+## Best Practices
+
+### Design
+
+1. **Assume failure** - Everything can fail
+2. **Design for graceful degradation**
+3. **Implement health checks**
+4. **Use async where possible**
+5. **Plan for data consistency**
+
+### Implementation
+
+1. **Set timeouts** on all external calls
+2. **Implement retries** with backoff
+3. **Use circuit breakers** for dependencies
+4. **Cache aggressively** where appropriate
+5. **Log and monitor** all failures
+
+### Operations
+
+1. **Run chaos engineering** tests
+2. **Practice disaster recovery**
+3. **Monitor SLIs/SLOs**
+4. **Automate failover** where possible
+
+## Source Documentation
+
+- Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md)
+- SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)