Application Layer Stability Guarantee
JitAi supports industry-standard and advanced stability assurance measures at the application layer.
Application layer updates are characterized by relatively localized impact, high update frequency, and sensitivity to user experience. This requires balancing stability with the ability to iterate rapidly.
- ๐ฏ Localized Impact: Updates to single or few applications with controllable risk
- โก Rapid Iteration: Supports frequent updates in response to business needs
- ๐ค User Control: Users choose when to upgrade, reducing forced upgrade risk
- ๐ Independent Deployment: Updates don't affect other applications, enabling fault isolation
Progressive validation processโ
Multiple runtime environmentsโ
Create multiple runtime environments in the JitAi operations platform, adopting a progressive validation workflow: Test Environment โ Beta Environment โ Production Environment.
Environment configuration strategyโ
- ๐งช Test Environment
- ๐ Beta Environment
- ๐ Production Environment
Purpose: Functional validation and basic performance testing
๐ง Environment Characteristics
- Data Source: Simulated or desensitized data
- Traffic Source: Testing team and developers
- Resource Allocation: Medium-scale resources for functional testing
โ Validation Focus
- Business logic correctness
- User interface and interaction experience
- Basic performance and response time
- Integration with other systems
Purpose: Real-world validation with production data
๐ง Environment Characteristics
- Data Source: Production data (read-only mode or replica)
- Traffic Source: Internal users and limited external users
- Resource Allocation: Near-production scale
โ Validation Focus
- Real data compatibility
- Production-grade performance and stability
- End-to-end business process validation
- Data security and consistency
Purpose: Live service for actual users
๐ง Environment Characteristics
- Data Source: Production data
- Traffic Source: All user traffic
- Resource Allocation: Production-grade configuration
โ Validation Focus
- Overall system stability
- User experience and satisfaction metrics
- Critical business metrics
- 24/7 availability
Version management and release strategyโ
| Release Stage | Version Status | Validation Cycle | Pass Criteria | Failure Handling |
|---|---|---|---|---|
| App Repository | Development completed | Code review | Code standards + functional completeness | Return to development for fixes |
| Test Environment | Functional testing | 1-2 days | Functional correctness + basic performance | Return to development stage |
| Beta Environment | Pre-production | 3-5 days | Real data compatibility + production performance | Analyze data issues |
| Production Environment | Production | Continuous monitoring | Stability metrics + user experience | Canary rollback |
Canary release mechanismโ
Node-level canary releaseโ
In the JitAi cluster architecture, one JitNode serves as the load balancer, controlling traffic distribution. The runtime environment entry address resolves to this node.
Controlling canary release processโ
Assessing stability and availabilityโ
Canary releases require simultaneous assessment of two dimensions: stability and availability.
- Stability: Technical metrics such as error rates and response times
- Availability: Business function uptime and user experience metrics
| Canary Stage | Canary Nodes | Traffic Ratio | Observation Period | Stability Standard | Availability Standard | Exception Handling |
|---|---|---|---|---|---|---|
| Initial canary | 1 node | 5% | 2 hours | Error rate < 0.01% | Business availability > 99.9% | Set traffic weight to 0% |
| Small-scale expansion | 2 nodes | 20% | 4 hours | Error rate < 0.005% | Business availability > 99.95% | Set traffic weight to 0% |
| Medium scale | 50% of nodes | 50% | 8 hours | Error rate < 0.001% | Business availability > 99.98% | Immediate rollback or set traffic to 0% |
| Full release | All nodes | 100% | Continuous monitoring | System stable | Business functioning normally | Emergency rollback |
When canary nodes exhibit abnormal behavior, immediately set their traffic weight to 0% for instant fault isolation:
- ๐จ Instant Response: Cut off traffic to abnormal nodes without waiting for rollback deployment
- ๐ก๏ธ User Protection: Ensures user requests aren't routed to problematic nodes
- ๐ Quick Recovery: Traffic can be rapidly restored once issues are resolved
- ๐ Data Retention: Nodes remain running for analysis and debugging
Operating canary release processโ
Standard release workflow:
- Select canary node: Choose one node as the initial canary
- Adjust traffic weight: Set the node's traffic weight to 5%
- Deploy new version: Deploy the new application version on the canary node
- Start monitoring: Enable comprehensive monitoring and alerting
- Dual assessment: Simultaneously assess stability and availability metrics
- Execute decision: Determine next steps based on assessment results
- Gradual expansion: Progressively increase canary nodes and traffic ratio after stabilization
- Complete release: Upgrade all nodes and restore normal traffic distribution
Exception handling workflow:
Traffic zeroing steps:
- Anomaly detection: Monitoring system detects stability or availability metric anomalies
- Instant isolation: Set canary node traffic weight to 0% (takes < 10 seconds)
- Status confirmation: Verify user traffic has completely switched to stable nodes
- Problem diagnosis: Analyze and debug issues in the isolated state
- Fix verification: Validate functionality after resolving problems
- Traffic recovery: Gradually restore traffic allocation to the node after verification
Observabilityโ
Observability features are currently under development and will be available soon.
Integrating with OpenTelemetry and APM ecosystemโ
The JitAi Application Runtime Platform supports OpenTelemetry, the industry-standard framework for observability. OpenTelemetry plays an essential role in technology evolution, ecosystem integration, and industry best practices.