324 lines
7.3 KiB
Markdown
324 lines
7.3 KiB
Markdown
# SAP BTP Resilience Reference
|
||
|
||
## Overview
|
||
|
||
Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
|
||
|
||
## Key Resources
|
||
|
||
| Resource | Description |
|
||
|----------|-------------|
|
||
| Developing Resilient Apps on SAP BTP | Patterns and examples |
|
||
| Route Multi-Region Traffic | GitHub implementation |
|
||
| Architecting Multi-Region Resiliency | Discovery Center reference |
|
||
|
||
## Cloud Foundry Resilience
|
||
|
||
### Availability Zones
|
||
|
||
**Automatic Distribution:**
|
||
- Applications spread across multiple AZs
|
||
- No manual configuration required
|
||
- Platform handles placement
|
||
|
||
**During AZ Failure:**
|
||
- ~1/3 instances become unavailable (3-zone deployment)
|
||
- Remaining instances handle increased load
|
||
- Cloud Foundry reschedules to healthy zones
|
||
|
||
**Best Practice:**
|
||
Configure sufficient instances to handle load during zone failures:
|
||
```
|
||
Minimum instances = Normal load instances × 1.5
|
||
```
|
||
|
||
### Instance Configuration
|
||
|
||
```yaml
|
||
# manifest.yml
|
||
applications:
|
||
- name: my-app
|
||
instances: 3 # At least 3 for HA
|
||
memory: 512M
|
||
health-check-type: http
|
||
health-check-http-endpoint: /health
|
||
```
|
||
|
||
### Health Checks
|
||
|
||
```javascript
|
||
// Express health endpoint
|
||
app.get('/health', (req, res) => {
|
||
const health = {
|
||
status: 'UP',
|
||
checks: {
|
||
database: checkDatabase(),
|
||
messaging: checkMessaging()
|
||
}
|
||
};
|
||
res.status(200).json(health);
|
||
});
|
||
```
|
||
|
||
## Kyma Resilience
|
||
|
||
### Istio Service Mesh
|
||
|
||
**Features:**
|
||
- Automatic retries
|
||
- Circuit breakers
|
||
- Timeouts
|
||
- Load balancing
|
||
|
||
### Configuration
|
||
|
||
```yaml
|
||
apiVersion: networking.istio.io/v1alpha3
|
||
kind: DestinationRule
|
||
metadata:
|
||
name: my-app-dr
|
||
spec:
|
||
host: my-app
|
||
trafficPolicy:
|
||
connectionPool:
|
||
tcp:
|
||
maxConnections: 100
|
||
http:
|
||
h2UpgradePolicy: UPGRADE
|
||
http1MaxPendingRequests: 100
|
||
http2MaxRequests: 1000
|
||
outlierDetection:
|
||
consecutive5xxErrors: 5
|
||
interval: 30s
|
||
baseEjectionTime: 30s
|
||
maxEjectionPercent: 50
|
||
```
|
||
|
||
### Pod Distribution
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
spec:
|
||
replicas: 3
|
||
template:
|
||
spec:
|
||
topologySpreadConstraints:
|
||
- maxSkew: 1
|
||
topologyKey: topology.kubernetes.io/zone
|
||
whenUnsatisfiable: DoNotSchedule
|
||
labelSelector:
|
||
matchLabels:
|
||
app: my-app
|
||
```
|
||
|
||
## ABAP Resilience
|
||
|
||
### Built-in Features
|
||
|
||
- Automatic workload distribution
|
||
- Work process management
|
||
- HANA failover support
|
||
- Session management
|
||
|
||
### Elastic Scaling
|
||
|
||
Automatic response to load:
|
||
- Scale between 1 ACU and configured max
|
||
- 0.5 ACU increments
|
||
- Metrics-based decisions
|
||
|
||
## Resilience Patterns
|
||
|
||
### Circuit Breaker
|
||
|
||
**Purpose**: Prevent cascading failures
|
||
|
||
**States:**
|
||
1. **Closed**: Normal operation
|
||
2. **Open**: Fail fast, skip calls
|
||
3. **Half-Open**: Test recovery
|
||
|
||
**Implementation (CAP - Node.js):**
|
||
|
||
> **Note**: The `opossum` library shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
|
||
|
||
```javascript
|
||
const CircuitBreaker = require('opossum');
|
||
|
||
const breaker = new CircuitBreaker(callRemoteService, {
|
||
timeout: 3000,
|
||
errorThresholdPercentage: 50,
|
||
resetTimeout: 30000
|
||
});
|
||
|
||
breaker.fallback(() => getCachedData());
|
||
|
||
const result = await breaker.fire(serviceParams);
|
||
```
|
||
|
||
### Retry with Exponential Backoff
|
||
|
||
**Purpose**: Handle transient failures
|
||
|
||
```javascript
|
||
async function retryWithBackoff(fn, maxRetries = 3) {
|
||
for (let i = 0; i < maxRetries; i++) {
|
||
try {
|
||
return await fn();
|
||
} catch (error) {
|
||
if (i === maxRetries - 1) throw error;
|
||
const delay = Math.pow(2, i) * 1000;
|
||
await new Promise(r => setTimeout(r, delay));
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Bulkhead
|
||
|
||
**Purpose**: Isolate failures
|
||
|
||
```javascript
|
||
const Semaphore = require('semaphore');
|
||
|
||
const dbPool = Semaphore(10); // Max 10 concurrent DB calls
|
||
const apiPool = Semaphore(20); // Max 20 concurrent API calls
|
||
|
||
async function callDatabase() {
|
||
return new Promise((resolve, reject) => {
|
||
dbPool.take(() => {
|
||
performDbCall()
|
||
.then(resolve)
|
||
.catch(reject)
|
||
.finally(() => dbPool.leave());
|
||
});
|
||
});
|
||
}
|
||
```
|
||
|
||
### Timeout
|
||
|
||
**Purpose**: Prevent hanging requests
|
||
|
||
```javascript
|
||
const timeout = (promise, ms) => {
|
||
return Promise.race([
|
||
promise,
|
||
new Promise((_, reject) =>
|
||
setTimeout(() => reject(new Error('Timeout')), ms)
|
||
)
|
||
]);
|
||
};
|
||
|
||
const result = await timeout(fetchData(), 5000);
|
||
```
|
||
|
||
### Graceful Degradation
|
||
|
||
**Purpose**: Provide reduced functionality instead of failing
|
||
|
||
```javascript
|
||
async function getProductDetails(id) {
|
||
try {
|
||
// Try full data
|
||
return await getFromPrimaryService(id);
|
||
} catch (error) {
|
||
// Fallback to cached/reduced data
|
||
const cached = await getFromCache(id);
|
||
if (cached) return { ...cached, _degraded: true };
|
||
|
||
// Final fallback
|
||
return getBasicDetails(id);
|
||
}
|
||
}
|
||
```
|
||
|
||
## Multi-Region Architecture
|
||
|
||
### Active-Passive
|
||
|
||
```
|
||
Region A (Primary) Region B (Standby)
|
||
↓ ↓
|
||
Active Standby
|
||
↓ ↓
|
||
HANA Cloud HANA Cloud (Replica)
|
||
```
|
||
|
||
**Failover**: Manual or automated switch
|
||
|
||
### Active-Active
|
||
|
||
```
|
||
Global Load Balancer
|
||
↓
|
||
┌─────────┴─────────┐
|
||
↓ ↓
|
||
Region A Region B
|
||
↓ ↓
|
||
HANA Cloud HANA Cloud
|
||
↓ ↓
|
||
└───── Replication ─┘
|
||
```
|
||
|
||
**Use Case**: Highest availability requirements
|
||
|
||
## Monitoring for Resilience
|
||
|
||
### Key Metrics
|
||
|
||
| Metric | Threshold | Action |
|
||
|--------|-----------|--------|
|
||
| Error rate | > 1% | Alert, investigate |
|
||
| Latency p99 | > 2s | Scale, optimize |
|
||
| Circuit breaker trips | Any | Review dependencies |
|
||
| Retry rate | > 5% | Check downstream services |
|
||
|
||
### Alerting
|
||
|
||
```yaml
|
||
# SAP Alert Notification example
|
||
conditions:
|
||
- name: high-error-rate
|
||
propertyKey: error_rate
|
||
predicate: GREATER_THAN
|
||
propertyValue: "0.01"
|
||
|
||
actions:
|
||
- name: page-oncall
|
||
type: EMAIL
|
||
properties:
|
||
destination: oncall@example.com
|
||
```
|
||
|
||
## Best Practices
|
||
|
||
### Design
|
||
|
||
1. **Assume failure** - Everything can fail
|
||
2. **Design for graceful degradation**
|
||
3. **Implement health checks**
|
||
4. **Use async where possible**
|
||
5. **Plan for data consistency**
|
||
|
||
### Implementation
|
||
|
||
1. **Set timeouts** on all external calls
|
||
2. **Implement retries** with backoff
|
||
3. **Use circuit breakers** for dependencies
|
||
4. **Cache aggressively** where appropriate
|
||
5. **Log and monitor** all failures
|
||
|
||
### Operations
|
||
|
||
1. **Run chaos engineering** tests
|
||
2. **Practice disaster recovery**
|
||
3. **Monitor SLIs/SLOs**
|
||
4. **Automate failover** where possible
|
||
|
||
## Source Documentation
|
||
|
||
- Developing Resilient Applications: [https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md](https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md)
|
||
- SAP BTP Resilience Guide: [https://help.sap.com/docs/btp/best-practices/developing-resilient-apps](https://help.sap.com/docs/btp/best-practices/developing-resilient-apps)
|