gh-secondsky-sap-skills-ski…/references/failover-and-resilience.md

# SAP BTP Failover and Resilience - Detailed Reference

**Source**: [https://github.com/SAP-docs/btp-best-practices-guide/tree/main/docs/deploy-and-deliver](https://github.com/SAP-docs/btp-best-practices-guide/tree/main/docs/deploy-and-deliver)

---

## High Availability Overview

SAP BTP applications can achieve high availability through multi-region deployment with intelligent traffic routing. This eliminates single points of failure and addresses latency concerns for global users.

---

## Multi-Region Architecture

### Core Concept

```
                    Custom Domain URL
                           │
                           ▼
                    ┌──────────────┐
                    │ Load Balancer │
                    │ (Health Checks)│
                    └──────┬───────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
    ┌─────────────────┐       ┌─────────────────┐
    │  Region 1       │       │  Region 2       │
    │  (Active)       │       │  (Passive/Active)│
    │                 │       │                 │
    │  Subaccount A   │       │  Subaccount B   │
    │  App Instance   │       │  App Instance   │
    └─────────────────┘       └─────────────────┘
```

### Key Benefits

- **Geographic Redundancy**: Regional outages don't interrupt service
- **Intelligent Routing**: Health checks direct traffic to operational instances
- **Unified Access**: Custom domain remains constant during failovers
- **Load Distribution**: Traffic balances across regions
- **Latency Optimization**: Route users to nearest healthy region

---

## Failover Scope

### Supported Application Types

The basic failover guidance applies to:
- SAPUI5 applications
- HTML5 applications
- Applications without data persistence
- Applications without in-memory caching
- Applications with data stored in on-premise back-end systems

**Note**: Applications with cloud-based persistence require additional considerations for data synchronization.

---

## Four Core Failover Principles

### 1. Deploy Across Two Data Centers

**Configuration**: Active/Passive

- Primary data center receives normal traffic
- Secondary data center acts as standby
- Switch to secondary only during primary downtime

**Regional Selection Best Practices**:
- Choose regions near users and backend systems
- Example: Frankfurt and Amsterdam for European users
- Consider same region for performance
- Review data processing restrictions for cross-region deployment

**Legal Consideration**: Cross-region deployment may create data processing compliance issues. Review Data Protection and Privacy documentation before proceeding.

### 2. Keep Applications Synchronized

**Synchronization Options**:

| Method | Effort | Best For |
|--------|--------|----------|
| Manual | High | Infrequent updates |
| CI/CD Pipeline | Medium | Regular deployments |
| Solution Export + Transport Management | Medium | Neo environments |

#### Manual Synchronization

- Duplicate modifications across both data centers
- Mirror Git repositories
- Allows non-identical applications (reduced functionality in backup)
- Visual differentiation between primary and backup

#### CI/CD Pipeline Synchronization

```yaml
# Pipeline deploys to both regions
stages:
  - name: Build
    steps:
      - build_mta

  - name: Deploy Primary
    steps:
      - deploy_to_region_1

  - name: Deploy Secondary
    steps:
      - deploy_to_region_2
```

- Use Project "Piper" pipelines adapted for multi-deployment
- Parallel deployment to subaccounts in different regions
- Automatic consistency

#### Solution Export Wizard + Cloud Transport Management

1. Export changes as MTA archive from primary (Neo)
2. Import via Transport Management Service (Cloud Foundry)
3. Deploy to secondary data center

### 3. Define Failover Detection

**Detection Mechanisms**:

- Response timeout monitoring (e.g., 25 seconds max)
- HTTP status code checking (5xx errors)
- Health endpoint monitoring

**Implementation Options**:
- Manual code implementation
- Rule-based solutions (e.g., Akamai ION)
- Load balancer health checks

**Detection Behavior**:
- Monitor first HTTP request to application URL
- Ignore subsequent requests (prevent single resource failures triggering failover)
- Present HTML down page with failover link when detected

**Note**: In basic scenarios, "the failover itself is therefore manually performed by the user" via the down page link.

### 4. Plan for Failback

**Active/Active Setup**:
- Applications identical in both data centers
- Same functionality everywhere
- Failback automatic with next failover event
- Not mandatory to explicitly return to primary

**Active/Passive Setup**:
- Applications may differ between data centers
- Reduced functionality in backup acceptable
- Failback to primary is mandatory
- Must restore full functionality

**Recommended Failback Approach**:
- User-driven failback model
- Visual differentiation reminds users to switch back
- Allow transactions to complete without interruption
- Prioritize completion over automatic recovery speed

---

## Multi-Region Reference Use Cases

### Available Implementations

| Scenario | Components | Resources |
|----------|------------|-----------|
| **SAP Build Work Zone + Azure Traffic Manager** | Work Zone, Azure | Blog post, GitHub, Discovery Center mission |
| **SAP Build Work Zone + Amazon Route 53** | Work Zone, AWS | Blog post, GitHub |
| **CAP Applications + SAP HANA Cloud** | CAP, HANA Cloud multi-zone | GitHub repository |
| **CAP Applications + Amazon Aurora** | CAP, Aurora read replica | GitHub repository |
| **SAP Cloud Integration + Azure Traffic Manager** | CPI, Azure | GitHub, Discovery Center |

### Architecture: SAP Build Work Zone + Azure Traffic Manager

```
User Request
     │
     ▼
Azure Traffic Manager
(Priority-based routing)
     │
     ├──► Primary Region (Priority 1)
     │    └── SAP Build Work Zone Instance
     │
     └──► Secondary Region (Priority 2)
          └── SAP Build Work Zone Instance (standby)
```

### Architecture: CAP + HANA Cloud Multi-Zone

```
     Application Load Balancer
              │
     ┌────────┴────────┐
     ▼                 ▼
CAP App (AZ1)     CAP App (AZ2)
     │                 │
     └────────┬────────┘
              ▼
    SAP HANA Cloud
    (Multi-zone replication)
```

---

## Data Backup and Resilience

### SAP-Managed Backups

| Service | Backup Type | Retention | Notes |
|---------|-------------|-----------|-------|
| **SAP HANA Cloud** | Continuous | As configured | Database recovery supported |
| **PostgreSQL (Hyperscaler)** | Point-in-time | 14 days | Restore by creating new instance |
| **Redis** | None | N/A | No persistence support |
| **Object Store** | None | N/A | Use versioning for protection |

### Object Store Protection Mechanisms

- **Object Versioning**: Recover from accidental deletion
- **Expiration Rules**: Automatic version cleanup
- **Deletion Prevention**: AWS S3 buckets, Azure containers

### Runtime-Specific Backup Strategies

| Runtime | Strategy |
|---------|----------|
| **Cloud Foundry** | Multi-AZ replication within region |
| **Kyma** | Managed K8s snapshots (excludes volumes) |
| **Neo** | Cross-region data copies |

### Customer-Managed Backups

**Critical**: SAP doesn't manage backups of service configurations. You are responsible for backing up your service-specific configurations.

**Key Responsibilities**:

| Responsibility | Details |
|----------------|---------|
| **Self-Service Backup** | You must back up service configurations yourself |
| **Service Documentation Review** | Consult each service's documentation for backup capabilities |
| **Service Limitations Awareness** | Some services don't support user-specific configuration backups |
| **Risk Mitigation** | Backup frequency varies by service; prevents accidental data loss |

**Actions Required**:
1. Identify all SAP BTP services currently in use
2. Review service-specific backup documentation
3. Understand which services lack backup capabilities
4. Implement backup strategies for supported configurations
5. Plan accordingly for services without backup features
6. Ensure business continuity plans include config recovery

**Note**: If backup information is unavailable for a service, contact SAP support channels.

---

## Kyma Cluster Sharing and Isolation

### When to Share Clusters

**Recommended**:
- Small teams with modest resource needs
- Development and testing environments
- Non-critical workloads
- Teams with established trust

**Not Recommended**:
- Multiple external customers
- Production workloads with strict isolation
- Untrusted tenants

### Control Plane Isolation Strategies

| Strategy | Description |
|----------|-------------|
| **Namespaces** | Isolate API resources within cluster |
| **RBAC** | Manage permissions within namespaces |
| **Global Resources** | Admin-managed CRDs and global objects |
| **Policy Engines** | Gatekeeper, Kyverno for compliance |
| **Resource Quotas** | Set limits per tenant/namespace |

### Data Plane Isolation Strategies

| Strategy | Description |
|----------|-------------|
| **Network Policies** | Restrict inter-namespace traffic |
| **Centralized Observability** | Cluster-wide metrics tracking |
| **Service Mesh (Istio)** | mTLS, dedicated ingress per tenant |

### Sample Network Policy

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-namespace
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: tenant-a
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: tenant-a
```

---

## Disaster Recovery Planning

### Recovery Time Objective (RTO)

Depends on:
- Failover detection speed
- Application synchronization lag
- User-driven vs automatic failover

### Recovery Point Objective (RPO)

Depends on:
- Synchronization frequency
- Data persistence strategy
- Backup retention periods

### DR Checklist

- [ ] Define RTO and RPO requirements
- [ ] Select multi-region architecture
- [ ] Implement application synchronization
- [ ] Configure failover detection
- [ ] Test failover procedures regularly
- [ ] Document failback process
- [ ] Train operations team
- [ ] Establish communication protocols

---

## Resilient Application Development

### Design Principles

1. **Statelessness**: Minimize in-memory state
2. **Idempotency**: Safe to retry operations
3. **Circuit Breakers**: Graceful degradation
4. **Health Endpoints**: Enable monitoring
5. **Graceful Shutdown**: Complete in-flight requests

### Health Endpoint Example

```javascript
app.get('/health', (req, res) => {
  const health = {
    status: 'UP',
    checks: {
      database: checkDatabase(),
      cache: checkCache(),
      externalApi: checkExternalApi()
    }
  };

  const allHealthy = Object.values(health.checks)
    .every(check => check.status === 'UP');

  res.status(allHealthy ? 200 : 503).json(health);
});
```

---

**Source Documentation**:
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/planning-failover-on-sap-btp-8c46464.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/planning-failover-on-sap-btp-8c46464.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/implementing-failover-df972c5.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/implementing-failover-df972c5.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/deploy-your-application-in-two-data-centers-61d08d8.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/deploy-your-application-in-two-data-centers-61d08d8.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/keep-the-two-applications-in-sync-e6d2bdb.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/keep-the-two-applications-in-sync-e6d2bdb.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/define-how-a-failover-is-detected-88b86db.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/define-how-a-failover-is-detected-88b86db.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/decide-on-the-failback-963f962.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/decide-on-the-failback-963f962.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/multi-region-usecases.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/multi-region-usecases.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/data-backups-managed-by-sap-6c1e071.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/deploy-and-deliver/data-backups-managed-by-sap-6c1e071.md)
- [https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/sharing-clusters-in-kyma-57ec1ea.md](https://github.com/SAP-docs/btp-best-practices-guide/blob/main/docs/set-up-and-plan/sharing-clusters-in-kyma-57ec1ea.md)