Initial commit
This commit is contained in:
323
references/resilience.md
Normal file
323
references/resilience.md
Normal file
@@ -0,0 +1,323 @@
|
||||
# SAP BTP Resilience Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
|
||||
|
||||
## Key Resources
|
||||
|
||||
| Resource | Description |
|
||||
|----------|-------------|
|
||||
| Developing Resilient Apps on SAP BTP | Patterns and examples |
|
||||
| Route Multi-Region Traffic | GitHub implementation |
|
||||
| Architecting Multi-Region Resiliency | Discovery Center reference |
|
||||
|
||||
## Cloud Foundry Resilience
|
||||
|
||||
### Availability Zones
|
||||
|
||||
**Automatic Distribution:**
|
||||
- Applications spread across multiple AZs
|
||||
- No manual configuration required
|
||||
- Platform handles placement
|
||||
|
||||
**During AZ Failure:**
|
||||
- ~1/3 instances become unavailable (3-zone deployment)
|
||||
- Remaining instances handle increased load
|
||||
- Cloud Foundry reschedules to healthy zones
|
||||
|
||||
**Best Practice:**
|
||||
Configure sufficient instances to handle load during zone failures:
|
||||
```
|
||||
Minimum instances = Normal load instances × 1.5
|
||||
```
|
||||
|
||||
### Instance Configuration
|
||||
|
||||
```yaml
|
||||
# manifest.yml
|
||||
applications:
|
||||
- name: my-app
|
||||
instances: 3 # At least 3 for HA
|
||||
memory: 512M
|
||||
health-check-type: http
|
||||
health-check-http-endpoint: /health
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```javascript
|
||||
// Express health endpoint
|
||||
app.get('/health', (req, res) => {
|
||||
const health = {
|
||||
status: 'UP',
|
||||
checks: {
|
||||
database: checkDatabase(),
|
||||
messaging: checkMessaging()
|
||||
}
|
||||
};
|
||||
res.status(200).json(health);
|
||||
});
|
||||
```
|
||||
|
||||
## Kyma Resilience
|
||||
|
||||
### Istio Service Mesh
|
||||
|
||||
**Features:**
|
||||
- Automatic retries
|
||||
- Circuit breakers
|
||||
- Timeouts
|
||||
- Load balancing
|
||||
|
||||
### Configuration
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1alpha3
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: my-app-dr
|
||||
spec:
|
||||
host: my-app
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
h2UpgradePolicy: UPGRADE
|
||||
http1MaxPendingRequests: 100
|
||||
http2MaxRequests: 1000
|
||||
outlierDetection:
|
||||
consecutive5xxErrors: 5
|
||||
interval: 30s
|
||||
baseEjectionTime: 30s
|
||||
maxEjectionPercent: 50
|
||||
```
|
||||
|
||||
### Pod Distribution
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
spec:
|
||||
topologySpreadConstraints:
|
||||
- maxSkew: 1
|
||||
topologyKey: topology.kubernetes.io/zone
|
||||
whenUnsatisfiable: DoNotSchedule
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app: my-app
|
||||
```
|
||||
|
||||
## ABAP Resilience
|
||||
|
||||
### Built-in Features
|
||||
|
||||
- Automatic workload distribution
|
||||
- Work process management
|
||||
- HANA failover support
|
||||
- Session management
|
||||
|
||||
### Elastic Scaling
|
||||
|
||||
Automatic response to load:
|
||||
- Scale between 1 ACU and configured max
|
||||
- 0.5 ACU increments
|
||||
- Metrics-based decisions
|
||||
|
||||
## Resilience Patterns
|
||||
|
||||
### Circuit Breaker
|
||||
|
||||
**Purpose**: Prevent cascading failures
|
||||
|
||||
**States:**
|
||||
1. **Closed**: Normal operation
|
||||
2. **Open**: Fail fast, skip calls
|
||||
3. **Half-Open**: Test recovery
|
||||
|
||||
**Implementation (CAP - Node.js):**
|
||||
|
||||
> **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
|
||||
|
||||
```javascript
|
||||
const CircuitBreaker = require('opossum');
|
||||
|
||||
const breaker = new CircuitBreaker(callRemoteService, {
|
||||
timeout: 3000,
|
||||
errorThresholdPercentage: 50,
|
||||
resetTimeout: 30000
|
||||
});
|
||||
|
||||
breaker.fallback(() => getCachedData());
|
||||
|
||||
const result = await breaker.fire(serviceParams);
|
||||
```
|
||||
|
||||
### Retry with Exponential Backoff
|
||||
|
||||
**Purpose**: Handle transient failures
|
||||
|
||||
```javascript
|
||||
async function retryWithBackoff(fn, maxRetries = 3) {
|
||||
for (let i = 0; i < maxRetries; i++) {
|
||||
try {
|
||||
return await fn();
|
||||
} catch (error) {
|
||||
if (i === maxRetries - 1) throw error;
|
||||
const delay = Math.pow(2, i) * 1000;
|
||||
await new Promise(r => setTimeout(r, delay));
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Bulkhead
|
||||
|
||||
**Purpose**: Isolate failures
|
||||
|
||||
```javascript
|
||||
const Semaphore = require('semaphore');
|
||||
|
||||
const dbPool = Semaphore(10); // Max 10 concurrent DB calls
|
||||
const apiPool = Semaphore(20); // Max 20 concurrent API calls
|
||||
|
||||
async function callDatabase() {
|
||||
return new Promise((resolve, reject) => {
|
||||
dbPool.take(() => {
|
||||
performDbCall()
|
||||
.then(resolve)
|
||||
.catch(reject)
|
||||
.finally(() => dbPool.leave());
|
||||
});
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### Timeout
|
||||
|
||||
**Purpose**: Prevent hanging requests
|
||||
|
||||
```javascript
|
||||
const timeout = (promise, ms) => {
|
||||
return Promise.race([
|
||||
promise,
|
||||
new Promise((_, reject) =>
|
||||
setTimeout(() => reject(new Error('Timeout')), ms)
|
||||
)
|
||||
]);
|
||||
};
|
||||
|
||||
const result = await timeout(fetchData(), 5000);
|
||||
```
|
||||
|
||||
### Graceful Degradation
|
||||
|
||||
**Purpose**: Provide reduced functionality instead of failing
|
||||
|
||||
```javascript
|
||||
async function getProductDetails(id) {
|
||||
try {
|
||||
// Try full data
|
||||
return await getFromPrimaryService(id);
|
||||
} catch (error) {
|
||||
// Fallback to cached/reduced data
|
||||
const cached = await getFromCache(id);
|
||||
if (cached) return { ...cached, _degraded: true };
|
||||
|
||||
// Final fallback
|
||||
return getBasicDetails(id);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Multi-Region Architecture
|
||||
|
||||
### Active-Passive
|
||||
|
||||
```
|
||||
Region A (Primary) Region B (Standby)
|
||||
↓ ↓
|
||||
Active Standby
|
||||
↓ ↓
|
||||
HANA Cloud HANA Cloud (Replica)
|
||||
```
|
||||
|
||||
**Failover**: Manual or automated switch
|
||||
|
||||
### Active-Active
|
||||
|
||||
```
|
||||
Global Load Balancer
|
||||
↓
|
||||
┌─────────┴─────────┐
|
||||
↓ ↓
|
||||
Region A Region B
|
||||
↓ ↓
|
||||
HANA Cloud HANA Cloud
|
||||
↓ ↓
|
||||
└───── Replication ─┘
|
||||
```
|
||||
|
||||
**Use Case**: Highest availability requirements
|
||||
|
||||
## Monitoring for Resilience
|
||||
|
||||
### Key Metrics
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| Error rate | > 1% | Alert, investigate |
|
||||
| Latency p99 | > 2s | Scale, optimize |
|
||||
| Circuit breaker trips | Any | Review dependencies |
|
||||
| Retry rate | > 5% | Check downstream services |
|
||||
|
||||
### Alerting
|
||||
|
||||
```yaml
|
||||
# SAP Alert Notification example
|
||||
conditions:
|
||||
- name: high-error-rate
|
||||
propertyKey: error_rate
|
||||
predicate: GREATER_THAN
|
||||
propertyValue: "0.01"
|
||||
|
||||
actions:
|
||||
- name: page-oncall
|
||||
type: EMAIL
|
||||
properties:
|
||||
destination: oncall@example.com
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Design
|
||||
|
||||
1. **Assume failure** - Everything can fail
|
||||
2. **Design for graceful degradation**
|
||||
3. **Implement health checks**
|
||||
4. **Use async where possible**
|
||||
5. **Plan for data consistency**
|
||||
|
||||
### Implementation
|
||||
|
||||
1. **Set timeouts** on all external calls
|
||||
2. **Implement retries** with backoff
|
||||
3. **Use circuit breakers** for dependencies
|
||||
4. **Cache aggressively** where appropriate
|
||||
5. **Log and monitor** all failures
|
||||
|
||||
### Operations
|
||||
|
||||
1. **Run chaos engineering** tests
|
||||
2. **Practice disaster recovery**
|
||||
3. **Monitor SLIs/SLOs**
|
||||
4. **Automate failover** where possible
|
||||
|
||||
## Source Documentation
|
||||
|
||||
- Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md)
|
||||
- SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)
|
||||
Reference in New Issue
Block a user