Files
2025-11-30 08:55:02 +08:00

324 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SAP BTP Resilience Reference
## Overview
Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
## Key Resources
| Resource | Description |
|----------|-------------|
| Developing Resilient Apps on SAP BTP | Patterns and examples |
| Route Multi-Region Traffic | GitHub implementation |
| Architecting Multi-Region Resiliency | Discovery Center reference |
## Cloud Foundry Resilience
### Availability Zones
**Automatic Distribution:**
- Applications spread across multiple AZs
- No manual configuration required
- Platform handles placement
**During AZ Failure:**
- ~1/3 instances become unavailable (3-zone deployment)
- Remaining instances handle increased load
- Cloud Foundry reschedules to healthy zones
**Best Practice:**
Configure sufficient instances to handle load during zone failures:
```
Minimum instances = Normal load instances × 1.5
```
### Instance Configuration
```yaml
# manifest.yml
applications:
- name: my-app
instances: 3 # At least 3 for HA
memory: 512M
health-check-type: http
health-check-http-endpoint: /health
```
### Health Checks
```javascript
// Express health endpoint
app.get('/health', (req, res) => {
const health = {
status: 'UP',
checks: {
database: checkDatabase(),
messaging: checkMessaging()
}
};
res.status(200).json(health);
});
```
## Kyma Resilience
### Istio Service Mesh
**Features:**
- Automatic retries
- Circuit breakers
- Timeouts
- Load balancing
### Configuration
```yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-app-dr
spec:
host: my-app
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
```
### Pod Distribution
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
```
## ABAP Resilience
### Built-in Features
- Automatic workload distribution
- Work process management
- HANA failover support
- Session management
### Elastic Scaling
Automatic response to load:
- Scale between 1 ACU and configured max
- 0.5 ACU increments
- Metrics-based decisions
## Resilience Patterns
### Circuit Breaker
**Purpose**: Prevent cascading failures
**States:**
1. **Closed**: Normal operation
2. **Open**: Fail fast, skip calls
3. **Half-Open**: Test recovery
**Implementation (CAP - Node.js):**
> **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
```javascript
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(callRemoteService, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => getCachedData());
const result = await breaker.fire(serviceParams);
```
### Retry with Exponential Backoff
**Purpose**: Handle transient failures
```javascript
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000;
await new Promise(r => setTimeout(r, delay));
}
}
}
```
### Bulkhead
**Purpose**: Isolate failures
```javascript
const Semaphore = require('semaphore');
const dbPool = Semaphore(10); // Max 10 concurrent DB calls
const apiPool = Semaphore(20); // Max 20 concurrent API calls
async function callDatabase() {
return new Promise((resolve, reject) => {
dbPool.take(() => {
performDbCall()
.then(resolve)
.catch(reject)
.finally(() => dbPool.leave());
});
});
}
```
### Timeout
**Purpose**: Prevent hanging requests
```javascript
const timeout = (promise, ms) => {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
)
]);
};
const result = await timeout(fetchData(), 5000);
```
### Graceful Degradation
**Purpose**: Provide reduced functionality instead of failing
```javascript
async function getProductDetails(id) {
try {
// Try full data
return await getFromPrimaryService(id);
} catch (error) {
// Fallback to cached/reduced data
const cached = await getFromCache(id);
if (cached) return { ...cached, _degraded: true };
// Final fallback
return getBasicDetails(id);
}
}
```
## Multi-Region Architecture
### Active-Passive
```
Region A (Primary) Region B (Standby)
↓ ↓
Active Standby
↓ ↓
HANA Cloud HANA Cloud (Replica)
```
**Failover**: Manual or automated switch
### Active-Active
```
Global Load Balancer
┌─────────┴─────────┐
↓ ↓
Region A Region B
↓ ↓
HANA Cloud HANA Cloud
↓ ↓
└───── Replication ─┘
```
**Use Case**: Highest availability requirements
## Monitoring for Resilience
### Key Metrics
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error rate | > 1% | Alert, investigate |
| Latency p99 | > 2s | Scale, optimize |
| Circuit breaker trips | Any | Review dependencies |
| Retry rate | > 5% | Check downstream services |
### Alerting
```yaml
# SAP Alert Notification example
conditions:
- name: high-error-rate
propertyKey: error_rate
predicate: GREATER_THAN
propertyValue: "0.01"
actions:
- name: page-oncall
type: EMAIL
properties:
destination: oncall@example.com
```
## Best Practices
### Design
1. **Assume failure** - Everything can fail
2. **Design for graceful degradation**
3. **Implement health checks**
4. **Use async where possible**
5. **Plan for data consistency**
### Implementation
1. **Set timeouts** on all external calls
2. **Implement retries** with backoff
3. **Use circuit breakers** for dependencies
4. **Cache aggressively** where appropriate
5. **Log and monitor** all failures
### Operations
1. **Run chaos engineering** tests
2. **Practice disaster recovery**
3. **Monitor SLIs/SLOs**
4. **Automate failover** where possible
## Source Documentation
- Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md)
- SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)