7.3 KiB
SAP BTP Resilience Reference
Overview
Building resilient applications ensures stability, high availability, and graceful degradation during failures. SAP BTP provides patterns and services to achieve enterprise-grade resilience.
Key Resources
| Resource | Description |
|---|---|
| Developing Resilient Apps on SAP BTP | Patterns and examples |
| Route Multi-Region Traffic | GitHub implementation |
| Architecting Multi-Region Resiliency | Discovery Center reference |
Cloud Foundry Resilience
Availability Zones
Automatic Distribution:
- Applications spread across multiple AZs
- No manual configuration required
- Platform handles placement
During AZ Failure:
- ~1/3 instances become unavailable (3-zone deployment)
- Remaining instances handle increased load
- Cloud Foundry reschedules to healthy zones
Best Practice: Configure sufficient instances to handle load during zone failures:
Minimum instances = Normal load instances × 1.5
Instance Configuration
# manifest.yml
applications:
- name: my-app
instances: 3 # At least 3 for HA
memory: 512M
health-check-type: http
health-check-http-endpoint: /health
Health Checks
// Express health endpoint
app.get('/health', (req, res) => {
const health = {
status: 'UP',
checks: {
database: checkDatabase(),
messaging: checkMessaging()
}
};
res.status(200).json(health);
});
Kyma Resilience
Istio Service Mesh
Features:
- Automatic retries
- Circuit breakers
- Timeouts
- Load balancing
Configuration
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-app-dr
spec:
host: my-app
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
Pod Distribution
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
ABAP Resilience
Built-in Features
- Automatic workload distribution
- Work process management
- HANA failover support
- Session management
Elastic Scaling
Automatic response to load:
- Scale between 1 ACU and configured max
- 0.5 ACU increments
- Metrics-based decisions
Resilience Patterns
Circuit Breaker
Purpose: Prevent cascading failures
States:
- Closed: Normal operation
- Open: Fail fast, skip calls
- Half-Open: Test recovery
Implementation (CAP - Node.js):
Note
: The
opossumlibrary shown below is a third-party community package, not SAP-supported. Evaluate its maintenance status, compatibility with your CAP/Node.js versions, and security posture before production use. For Java applications, SAP Cloud SDK integrates with Resilience4j as the official resilience tooling.
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(callRemoteService, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => getCachedData());
const result = await breaker.fire(serviceParams);
Retry with Exponential Backoff
Purpose: Handle transient failures
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000;
await new Promise(r => setTimeout(r, delay));
}
}
}
Bulkhead
Purpose: Isolate failures
const Semaphore = require('semaphore');
const dbPool = Semaphore(10); // Max 10 concurrent DB calls
const apiPool = Semaphore(20); // Max 20 concurrent API calls
async function callDatabase() {
return new Promise((resolve, reject) => {
dbPool.take(() => {
performDbCall()
.then(resolve)
.catch(reject)
.finally(() => dbPool.leave());
});
});
}
Timeout
Purpose: Prevent hanging requests
const timeout = (promise, ms) => {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
)
]);
};
const result = await timeout(fetchData(), 5000);
Graceful Degradation
Purpose: Provide reduced functionality instead of failing
async function getProductDetails(id) {
try {
// Try full data
return await getFromPrimaryService(id);
} catch (error) {
// Fallback to cached/reduced data
const cached = await getFromCache(id);
if (cached) return { ...cached, _degraded: true };
// Final fallback
return getBasicDetails(id);
}
}
Multi-Region Architecture
Active-Passive
Region A (Primary) Region B (Standby)
↓ ↓
Active Standby
↓ ↓
HANA Cloud HANA Cloud (Replica)
Failover: Manual or automated switch
Active-Active
Global Load Balancer
↓
┌─────────┴─────────┐
↓ ↓
Region A Region B
↓ ↓
HANA Cloud HANA Cloud
↓ ↓
└───── Replication ─┘
Use Case: Highest availability requirements
Monitoring for Resilience
Key Metrics
| Metric | Threshold | Action |
|---|---|---|
| Error rate | > 1% | Alert, investigate |
| Latency p99 | > 2s | Scale, optimize |
| Circuit breaker trips | Any | Review dependencies |
| Retry rate | > 5% | Check downstream services |
Alerting
# SAP Alert Notification example
conditions:
- name: high-error-rate
propertyKey: error_rate
predicate: GREATER_THAN
propertyValue: "0.01"
actions:
- name: page-oncall
type: EMAIL
properties:
destination: oncall@example.com
Best Practices
Design
- Assume failure - Everything can fail
- Design for graceful degradation
- Implement health checks
- Use async where possible
- Plan for data consistency
Implementation
- Set timeouts on all external calls
- Implement retries with backoff
- Use circuit breakers for dependencies
- Cache aggressively where appropriate
- Log and monitor all failures
Operations
- Run chaos engineering tests
- Practice disaster recovery
- Monitor SLIs/SLOs
- Automate failover where possible
Source Documentation
- Developing Resilient Applications: https://github.com/SAP-docs/btp-developer-guide/blob/main/docs/developing-resilient-applications-b1b929a.md
- SAP BTP Resilience Guide: https://help.sap.com/docs/btp/best-practices/developing-resilient-apps