Initial commit
This commit is contained in:
392
references/failover-and-resilience.md
Normal file
392
references/failover-and-resilience.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# SAP BTP Failover and Resilience - Detailed Reference
|
||||
|
||||
**Source**: [https://github.com/SAP-docs/btp-best-practices-guide/tree/main/docs/deploy-and-deliver](https://github.com/SAP-docs/btp-best-practices-guide/tree/main/docs/deploy-and-deliver)
|
||||
|
||||
---
|
||||
|
||||
## High Availability Overview
|
||||
|
||||
SAP BTP applications can achieve high availability through multi-region deployment with intelligent traffic routing. This eliminates single points of failure and addresses latency concerns for global users.
|
||||
|
||||
---
|
||||
|
||||
## Multi-Region Architecture
|
||||
|
||||
### Core Concept
|
||||
|
||||
```
|
||||
Custom Domain URL
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ Load Balancer │
|
||||
│ (Health Checks)│
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌────────────┴────────────┐
|
||||
▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ Region 1 │ │ Region 2 │
|
||||
│ (Active) │ │ (Passive/Active)│
|
||||
│ │ │ │
|
||||
│ Subaccount A │ │ Subaccount B │
|
||||
│ App Instance │ │ App Instance │
|
||||
└─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### Key Benefits
|
||||
|
||||
- **Geographic Redundancy**: Regional outages don't interrupt service
|
||||
- **Intelligent Routing**: Health checks direct traffic to operational instances
|
||||
- **Unified Access**: Custom domain remains constant during failovers
|
||||
- **Load Distribution**: Traffic balances across regions
|
||||
- **Latency Optimization**: Route users to nearest healthy region
|
||||
|
||||
---
|
||||
|
||||
## Failover Scope
|
||||
|
||||
### Supported Application Types
|
||||
|
||||
The basic failover guidance applies to:
|
||||
- SAPUI5 applications
|
||||
- HTML5 applications
|
||||
- Applications without data persistence
|
||||
- Applications without in-memory caching
|
||||
- Applications with data stored in on-premise back-end systems
|
||||
|
||||
**Note**: Applications with cloud-based persistence require additional considerations for data synchronization.
|
||||
|
||||
---
|
||||
|
||||
## Four Core Failover Principles
|
||||
|
||||
### 1. Deploy Across Two Data Centers
|
||||
|
||||
**Configuration**: Active/Passive
|
||||
|
||||
- Primary data center receives normal traffic
|
||||
- Secondary data center acts as standby
|
||||
- Switch to secondary only during primary downtime
|
||||
|
||||
**Regional Selection Best Practices**:
|
||||
- Choose regions near users and backend systems
|
||||
- Example: Frankfurt and Amsterdam for European users
|
||||
- Consider same region for performance
|
||||
- Review data processing restrictions for cross-region deployment
|
||||
|
||||
**Legal Consideration**: Cross-region deployment may create data processing compliance issues. Review Data Protection and Privacy documentation before proceeding.
|
||||
|
||||
### 2. Keep Applications Synchronized
|
||||
|
||||
**Synchronization Options**:
|
||||
|
||||
| Method | Effort | Best For |
|
||||
|--------|--------|----------|
|
||||
| Manual | High | Infrequent updates |
|
||||
| CI/CD Pipeline | Medium | Regular deployments |
|
||||
| Solution Export + Transport Management | Medium | Neo environments |
|
||||
|
||||
#### Manual Synchronization
|
||||
|
||||
- Duplicate modifications across both data centers
|
||||
- Mirror Git repositories
|
||||
- Allows non-identical applications (reduced functionality in backup)
|
||||
- Visual differentiation between primary and backup
|
||||
|
||||
#### CI/CD Pipeline Synchronization
|
||||
|
||||
```yaml
|
||||
# Pipeline deploys to both regions
|
||||
stages:
|
||||
- name: Build
|
||||
steps:
|
||||
- build_mta
|
||||
|
||||
- name: Deploy Primary
|
||||
steps:
|
||||
- deploy_to_region_1
|
||||
|
||||
- name: Deploy Secondary
|
||||
steps:
|
||||
- deploy_to_region_2
|
||||
```
|
||||
|
||||
- Use Project "Piper" pipelines adapted for multi-deployment
|
||||
- Parallel deployment to subaccounts in different regions
|
||||
- Automatic consistency
|
||||
|
||||
#### Solution Export Wizard + Cloud Transport Management
|
||||
|
||||
1. Export changes as MTA archive from primary (Neo)
|
||||
2. Import via Transport Management Service (Cloud Foundry)
|
||||
3. Deploy to secondary data center
|
||||
|
||||
### 3. Define Failover Detection
|
||||
|
||||
**Detection Mechanisms**:
|
||||
|
||||
- Response timeout monitoring (e.g., 25 seconds max)
|
||||
- HTTP status code checking (5xx errors)
|
||||
- Health endpoint monitoring
|
||||
|
||||
**Implementation Options**:
|
||||
- Manual code implementation
|
||||
- Rule-based solutions (e.g., Akamai ION)
|
||||
- Load balancer health checks
|
||||
|
||||
**Detection Behavior**:
|
||||
- Monitor first HTTP request to application URL
|
||||
- Ignore subsequent requests (prevent single resource failures triggering failover)
|
||||
- Present HTML down page with failover link when detected
|
||||
|
||||
**Note**: In basic scenarios, "the failover itself is therefore manually performed by the user" via the down page link.
|
||||
|
||||
### 4. Plan for Failback
|
||||
|
||||
**Active/Active Setup**:
|
||||
- Applications identical in both data centers
|
||||
- Same functionality everywhere
|
||||
- Failback automatic with next failover event
|
||||
- Not mandatory to explicitly return to primary
|
||||
|
||||
**Active/Passive Setup**:
|
||||
- Applications may differ between data centers
|
||||
- Reduced functionality in backup acceptable
|
||||
- Failback to primary is mandatory
|
||||
- Must restore full functionality
|
||||
|
||||
**Recommended Failback Approach**:
|
||||
- User-driven failback model
|
||||
- Visual differentiation reminds users to switch back
|
||||
- Allow transactions to complete without interruption
|
||||
- Prioritize completion over automatic recovery speed
|
||||
|
||||
---
|
||||
|
||||
## Multi-Region Reference Use Cases
|
||||
|
||||
### Available Implementations
|
||||
|
||||
| Scenario | Components | Resources |
|
||||
|----------|------------|-----------|
|
||||
| **SAP Build Work Zone + Azure Traffic Manager** | Work Zone, Azure | Blog post, GitHub, Discovery Center mission |
|
||||
| **SAP Build Work Zone + Amazon Route 53** | Work Zone, AWS | Blog post, GitHub |
|
||||
| **CAP Applications + SAP HANA Cloud** | CAP, HANA Cloud multi-zone | GitHub repository |
|
||||
| **CAP Applications + Amazon Aurora** | CAP, Aurora read replica | GitHub repository |
|
||||
| **SAP Cloud Integration + Azure Traffic Manager** | CPI, Azure | GitHub, Discovery Center |
|
||||
|
||||
### Architecture: SAP Build Work Zone + Azure Traffic Manager
|
||||
|
||||
```
|
||||
User Request
|
||||
│
|
||||
▼
|
||||
Azure Traffic Manager
|
||||
(Priority-based routing)
|
||||
│
|
||||
├──► Primary Region (Priority 1)
|
||||
│ └── SAP Build Work Zone Instance
|
||||
│
|
||||
└──► Secondary Region (Priority 2)
|
||||
└── SAP Build Work Zone Instance (standby)
|
||||
```
|
||||
|
||||
### Architecture: CAP + HANA Cloud Multi-Zone
|
||||
|
||||
```
|
||||
Application Load Balancer
|
||||
│
|
||||
┌────────┴────────┐
|
||||
▼ ▼
|
||||
CAP App (AZ1) CAP App (AZ2)
|
||||
│ │
|
||||
└────────┬────────┘
|
||||
▼
|
||||
SAP HANA Cloud
|
||||
(Multi-zone replication)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Backup and Resilience
|
||||
|
||||
### SAP-Managed Backups
|
||||
|
||||
| Service | Backup Type | Retention | Notes |
|
||||
|---------|-------------|-----------|-------|
|
||||
| **SAP HANA Cloud** | Continuous | As configured | Database recovery supported |
|
||||
| **PostgreSQL (Hyperscaler)** | Point-in-time | 14 days | Restore by creating new instance |
|
||||
| **Redis** | None | N/A | No persistence support |
|
||||
| **Object Store** | None | N/A | Use versioning for protection |
|
||||
|
||||
### Object Store Protection Mechanisms
|
||||
|
||||
- **Object Versioning**: Recover from accidental deletion
|
||||
- **Expiration Rules**: Automatic version cleanup
|
||||
- **Deletion Prevention**: AWS S3 buckets, Azure containers
|
||||
|
||||
### Runtime-Specific Backup Strategies
|
||||
|
||||
| Runtime | Strategy |
|
||||
|---------|----------|
|
||||
| **Cloud Foundry** | Multi-AZ replication within region |
|
||||
| **Kyma** | Managed K8s snapshots (excludes volumes) |
|
||||
| **Neo** | Cross-region data copies |
|
||||
|
||||
### Customer-Managed Backups
|
||||
|
||||
**Critical**: SAP doesn't manage backups of service configurations. You are responsible for backing up your service-specific configurations.
|
||||
|
||||
**Key Responsibilities**:
|
||||
|
||||
| Responsibility | Details |
|
||||
|----------------|---------|
|
||||
| **Self-Service Backup** | You must back up service configurations yourself |
|
||||
| **Service Documentation Review** | Consult each service's documentation for backup capabilities |
|
||||
| **Service Limitations Awareness** | Some services don't support user-specific configuration backups |
|
||||
| **Risk Mitigation** | Backup frequency varies by service; prevents accidental data loss |
|
||||
|
||||
**Actions Required**:
|
||||
1. Identify all SAP BTP services currently in use
|
||||
2. Review service-specific backup documentation
|
||||
3. Understand which services lack backup capabilities
|
||||
4. Implement backup strategies for supported configurations
|
||||
5. Plan accordingly for services without backup features
|
||||
6. Ensure business continuity plans include config recovery
|
||||
|
||||
**Note**: If backup information is unavailable for a service, contact SAP support channels.
|
||||
|
||||
---
|
||||
|
||||
## Kyma Cluster Sharing and Isolation
|
||||
|
||||
### When to Share Clusters
|
||||
|
||||
**Recommended**:
|
||||
- Small teams with modest resource needs
|
||||
- Development and testing environments
|
||||
- Non-critical workloads
|
||||
- Teams with established trust
|
||||
|
||||
**Not Recommended**:
|
||||
- Multiple external customers
|
||||
- Production workloads with strict isolation
|
||||
- Untrusted tenants
|
||||
|
||||
### Control Plane Isolation Strategies
|
||||
|
||||
| Strategy | Description |
|
||||
|----------|-------------|
|
||||
| **Namespaces** | Isolate API resources within cluster |
|
||||
| **RBAC** | Manage permissions within namespaces |
|
||||
| **Global Resources** | Admin-managed CRDs and global objects |
|
||||
| **Policy Engines** | Gatekeeper, Kyverno for compliance |
|
||||
| **Resource Quotas** | Set limits per tenant/namespace |
|
||||
|
||||
### Data Plane Isolation Strategies
|
||||
|
||||
| Strategy | Description |
|
||||
|----------|-------------|
|
||||
| **Network Policies** | Restrict inter-namespace traffic |
|
||||
| **Centralized Observability** | Cluster-wide metrics tracking |
|
||||
| **Service Mesh (Istio)** | mTLS, dedicated ingress per tenant |
|
||||
|
||||
### Sample Network Policy
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: deny-cross-namespace
|
||||
namespace: tenant-a
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
- Egress
|
||||
ingress:
|
||||
- from:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
name: tenant-a
|
||||
egress:
|
||||
- to:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
name: tenant-a
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery Planning
|
||||
|
||||
### Recovery Time Objective (RTO)
|
||||
|
||||
Depends on:
|
||||
- Failover detection speed
|
||||
- Application synchronization lag
|
||||
- User-driven vs automatic failover
|
||||
|
||||
### Recovery Point Objective (RPO)
|
||||
|
||||
Depends on:
|
||||
- Synchronization frequency
|
||||
- Data persistence strategy
|
||||
- Backup retention periods
|
||||
|
||||
### DR Checklist
|
||||
|
||||
- [ ] Define RTO and RPO requirements
|
||||
- [ ] Select multi-region architecture
|
||||
- [ ] Implement application synchronization
|
||||
- [ ] Configure failover detection
|
||||
- [ ] Test failover procedures regularly
|
||||
- [ ] Document failback process
|
||||
- [ ] Train operations team
|
||||
- [ ] Establish communication protocols
|
||||
|
||||
---
|
||||
|
||||
## Resilient Application Development
|
||||
|
||||
### Design Principles
|
||||
|
||||
1. **Statelessness**: Minimize in-memory state
|
||||
2. **Idempotency**: Safe to retry operations
|
||||
3. **Circuit Breakers**: Graceful degradation
|
||||
4. **Health Endpoints**: Enable monitoring
|
||||
5. **Graceful Shutdown**: Complete in-flight requests
|
||||
|
||||
### Health Endpoint Example
|
||||
|
||||
```javascript
|
||||
app.get('/health', (req, res) => {
|
||||
const health = {
|
||||
status: 'UP',
|
||||
checks: {
|
||||
database: checkDatabase(),
|
||||
cache: checkCache(),
|
||||
externalApi: checkExternalApi()
|
||||
}
|
||||
};
|
||||
|
||||
const allHealthy = Object.values(health.checks)
|
||||
.every(check => check.status === 'UP');
|
||||
|
||||
res.status(allHealthy ? 200 : 503).json(health);
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Source Documentation**:
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/planning-failover-on-sap-btp-8c46464.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/planning-failover-on-sap-btp-8c46464.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/implementing-failover-df972c5.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/implementing-failover-df972c5.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/deploy-your-application-in-two-data-centers-61d08d8.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/deploy-your-application-in-two-data-centers-61d08d8.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/keep-the-two-applications-in-sync-e6d2bdb.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/keep-the-two-applications-in-sync-e6d2bdb.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/define-how-a-failover-is-detected-88b86db.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/define-how-a-failover-is-detected-88b86db.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/decide-on-the-failback-963f962.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/decide-on-the-failback-963f962.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/multi-region-usecases.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/multi-region-usecases.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/data-backups-managed-by-sap-6c1e071.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/data-backups-managed-by-sap-6c1e071.md)
|
||||
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/sharing-clusters-in-kyma-57ec1ea.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/sharing-clusters-in-kyma-57ec1ea.md)
|
||||
Reference in New Issue
Block a user