Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:55:02 +08:00
commit 6942e32e6b
26 changed files with 7173 additions and 0 deletions

323
references/resilience.md Normal file
View File

@@ -0,0 +1,323 @@
# SAP BTP Resilience Reference
## Overview
Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
## Key Resources
| Resource | Description |
|----------|-------------|
| Developing Resilient Apps on SAP BTP | Patterns and examples |
| Route Multi-Region Traffic | GitHub implementation |
| Architecting Multi-Region Resiliency | Discovery Center reference |
## Cloud Foundry Resilience
### Availability Zones
**Automatic Distribution:**
- Applications spread across multiple AZs
- No manual configuration required
- Platform handles placement
**During AZ Failure:**
- ~1/3 instances become unavailable (3-zone deployment)
- Remaining instances handle increased load
- Cloud Foundry reschedules to healthy zones
**Best Practice:**
Configure sufficient instances to handle load during zone failures:
```
Minimum instances = Normal load instances × 1.5
```
### Instance Configuration
```yaml
# manifest.yml
applications:
- name: my-app
instances: 3 # At least 3 for HA
memory: 512M
health-check-type: http
health-check-http-endpoint: /health
```
### Health Checks
```javascript
// Express health endpoint
app.get('/health', (req, res) => {
const health = {
status: 'UP',
checks: {
database: checkDatabase(),
messaging: checkMessaging()
}
};
res.status(200).json(health);
});
```
## Kyma Resilience
### Istio Service Mesh
**Features:**
- Automatic retries
- Circuit breakers
- Timeouts
- Load balancing
### Configuration
```yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-app-dr
spec:
host: my-app
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
```
### Pod Distribution
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
```
## ABAP Resilience
### Built-in Features
- Automatic workload distribution
- Work process management
- HANA failover support
- Session management
### Elastic Scaling
Automatic response to load:
- Scale between 1 ACU and configured max
- 0.5 ACU increments
- Metrics-based decisions
## Resilience Patterns
### Circuit Breaker
**Purpose**: Prevent cascading failures
**States:**
1. **Closed**: Normal operation
2. **Open**: Fail fast, skip calls
3. **Half-Open**: Test recovery
**Implementation (CAP - Node.js):**
> **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
```javascript
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(callRemoteService, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => getCachedData());
const result = await breaker.fire(serviceParams);
```
### Retry with Exponential Backoff
**Purpose**: Handle transient failures
```javascript
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000;
await new Promise(r => setTimeout(r, delay));
}
}
}
```
### Bulkhead
**Purpose**: Isolate failures
```javascript
const Semaphore = require('semaphore');
const dbPool = Semaphore(10); // Max 10 concurrent DB calls
const apiPool = Semaphore(20); // Max 20 concurrent API calls
async function callDatabase() {
return new Promise((resolve, reject) => {
dbPool.take(() => {
performDbCall()
.then(resolve)
.catch(reject)
.finally(() => dbPool.leave());
});
});
}
```
### Timeout
**Purpose**: Prevent hanging requests
```javascript
const timeout = (promise, ms) => {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
)
]);
};
const result = await timeout(fetchData(), 5000);
```
### Graceful Degradation
**Purpose**: Provide reduced functionality instead of failing
```javascript
async function getProductDetails(id) {
try {
// Try full data
return await getFromPrimaryService(id);
} catch (error) {
// Fallback to cached/reduced data
const cached = await getFromCache(id);
if (cached) return { ...cached, _degraded: true };
// Final fallback
return getBasicDetails(id);
}
}
```
## Multi-Region Architecture
### Active-Passive
```
Region A (Primary) Region B (Standby)
↓ ↓
Active Standby
↓ ↓
HANA Cloud HANA Cloud (Replica)
```
**Failover**: Manual or automated switch
### Active-Active
```
Global Load Balancer
┌─────────┴─────────┐
↓ ↓
Region A Region B
↓ ↓
HANA Cloud HANA Cloud
↓ ↓
└───── Replication ─┘
```
**Use Case**: Highest availability requirements
## Monitoring for Resilience
### Key Metrics
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error rate | > 1% | Alert, investigate |
| Latency p99 | > 2s | Scale, optimize |
| Circuit breaker trips | Any | Review dependencies |
| Retry rate | > 5% | Check downstream services |
### Alerting
```yaml
# SAP Alert Notification example
conditions:
- name: high-error-rate
propertyKey: error_rate
predicate: GREATER_THAN
propertyValue: "0.01"
actions:
- name: page-oncall
type: EMAIL
properties:
destination: oncall@example.com
```
## Best Practices
### Design
1. **Assume failure** - Everything can fail
2. **Design for graceful degradation**
3. **Implement health checks**
4. **Use async where possible**
5. **Plan for data consistency**
### Implementation
1. **Set timeouts** on all external calls
2. **Implement retries** with backoff
3. **Use circuit breakers** for dependencies
4. **Cache aggressively** where appropriate
5. **Log and monitor** all failures
### Operations
1. **Run chaos engineering** tests
2. **Practice disaster recovery**
3. **Monitor SLIs/SLOs**
4. **Automate failover** where possible
## Source Documentation
- Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md)
- SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)