PhD Research Proposal: Designing Survivable Hybrid Cloud Architectures for Critical Infrastructure Operational Resilience
Executive Summary
This research addresses a critical gap in cloud computing resilience: how organisations can maintain operational continuity during catastrophic third-party service failures. Through analysis of six major incidents in 2025 (AWS, JLR, M&S, Renault, Cloudflare, Gainsight), a pattern emerges—modern disruptions stem from configuration errors, supply-chain vulnerabilities, and identity-layer attacks rather than infrastructure failures. This proposal presents a design-science approach to develop, implement, and validate a "Survivable Hybrid Cloud" architecture combining cloud benefits with local failover capabilities, applicable across critical sectors including finance, healthcare, manufacturing, and government services.
Research Evolution & Scope Expansion
From Financial Outages to Systemic Third-Party Risk
Initial research focus (October 2025) centred on financial sector resilience following AWS outages. Subsequent weekly breach analysis revealed this is not a financial-sector problem but a cross-industry systemic vulnerability:
- Manufacturing: JLR £1.3B loss, Renault £380M loss (OT/IT convergence risks)
- Retail: M&S £300M loss (SaaS integration vulnerabilities)
- Infrastructure: Cloudflare global disruption (centralised control-plane fragility)
- SaaS Ecosystem: Gainsight breach (OAuth token supply-chain risk)
- Cloud Providers: AWS cascade failures (maintenance-window exploitation)
Revised Scope: Research now addresses critical infrastructure resilience broadly, with financial sector as primary case study but automotive, retail, and government as validation domains.
1. Background and Rationale
Catalyst Event: 20 October 2025
The AWS Global/London Region outage on 20 October 2025 demonstrated how a planned maintenance window, when combined with a coordinated 6 Tbps DDoS attack, could cascade across multiple geographic regions, halting critical financial services for six hours. This incident prompted systematic investigation into third-party cloud dependencies.
Emerging Pattern: Configuration Over Infrastructure
Analysis of 2025 incidents reveals a fundamental shift in failure modes:
| Historical Failures (Pre-2020) | Modern Failures (2020-2025) |
|---|---|
| Hardware failures | Configuration errors |
| Network outages | Automation logic defects |
| Power failures | Supply-chain compromises |
| Capacity exhaustion | Identity-layer attacks |
| Physical attacks | Control-plane vulnerabilities |
Professional Context
Three decades of financial-sector systems consulting, specialising in large-scale software integration, operational resilience, and continuity planning, combined with electronic engineering and software engineering qualifications, provides both analytical capabilities and domain expertise to pursue this research.
Systemic Risk Identified
Current business continuity frameworks assume independent failures. Evidence shows:
- Tier-2 suppliers serve multiple competitors (Renault/JLR shared vulnerability)
- OAuth integrations create transitive trust chains (Gainsight→Salesforce cascade)
- CDN concentration creates shared-fate domains (Cloudflare fronts 20%+ of top sites)
- Credential reuse enables cross-border propagation (Renault France→UK→supplier network)
2. Research Gap
Existing literature addresses cloud security, availability metrics, and provider reliability in isolation. Critical gaps remain:
Gap 1: Holistic Resilience
No framework combines cloud benefits, local infrastructure, and governance in unified resilience model
Gap 2: Third-Party Risk
Supply-chain vulnerabilities in SaaS/PaaS ecosystems remain under-researched despite increasing incidents
Gap 3: Empirical Validation
Limited evidence showing how architectural patterns improve recovery in regulated environments
Gap 4: War Scenarios
No research on catastrophic infrastructure failure planning (nation-state attacks, physical infrastructure loss)
Novel Contribution
This research bridges these gaps through design-science approach: building and evaluating artefacts that demonstrate measurable improvements in recovery time, data integrity, and auditability under realistic failure scenarios including:
- Total cloud provider unavailability
- Supply-chain compromise cascades
- Internet infrastructure attacks
- Physical data centre seizure (war scenario)
3. Research Questions
| Question | Scope & Rationale |
|---|---|
| RQ1: Service Continuity |
How can a critical infrastructure organisation maintain service continuity during total
third-party web-service loss?
Focus: Technical architecture enabling zero-downtime failover from cloud to local infrastructure |
| RQ2: Architectural Patterns |
What architectural and governance patterns characterise "survivable" hybrid cloud systems?
Focus: Design patterns, reference architectures, and maturity models for resilience |
| RQ3: Validation & Compliance |
How can these patterns be modelled, implemented, and empirically validated against RTO/RPO
targets and regulatory frameworks (Basel III, NIS2, DORA)?
Focus: Measurable outcomes aligned with regulatory requirements |
| RQ4: Risk Assessment |
How can organisations quantify their third-party dependency risk exposure?
Focus: Practical tools for dependency scanning, risk scoring, and decision support |
| RQ5: War Scenarios |
What resilience strategies apply during catastrophic infrastructure failures (nation-state attacks,
physical seizure, internet partitioning)?
Focus: Extreme but realistic scenarios requiring complete autonomy from external services |
4. Research Objectives
Design Reference Architecture: Create validated hybrid-cloud resilience architecture integrating cloud databases, local infrastructure, DNS autonomy, and identity management
Develop Resilience Metrics: Establish quantitative measures for availability, integrity, latency, and recovery capability across failure scenarios
Implement Working Prototype: Build functional system demonstrating real-time cloud→local replication with automatic failover
Empirical Validation: Evaluate through controlled failure injection, transactional workloads, and comparison against traditional DR approaches
Create Practical Tools: Develop dependency scanner, risk calculator, and vendor assessment matrix for industry use
Publish Guidelines: Produce design patterns and implementation guidance for financial sector, manufacturing, and critical infrastructure adoption
Regulatory Alignment: Map architecture to Basel III, NIS2, DORA, and other frameworks; provide compliance guidance
5. Methodology
Design Science Research (DSR) Approach
Following Hevner et al. (2004), this research employs iterative design-build-evaluate cycles to create and validate artefacts addressing identified problems.
| Phase | Activities | Artefacts | Evaluation Methods |
|---|---|---|---|
| Phase 1: Problem Identification |
• Systematic incident analysis • Literature review • Practitioner interviews • Regulatory requirement analysis |
• Incident database (ongoing) • Problem taxonomy • Requirements specification |
• Pattern recognition • Stakeholder validation |
| Phase 2: Solution Design |
• Reference architecture design • Pattern catalogue development • Metric framework creation • Risk assessment model |
• Hybrid resilience architecture • Design patterns • Maturity model • Assessment tools |
• Expert review • Theoretical validation • Feasibility analysis |
| Phase 3: Implementation |
• Prototype development • Testbed construction • Tool implementation • Documentation creation |
• Working prototype system • Dependency scanner • Risk calculator • Deployment guides |
• Unit testing • Integration testing • Security assessment |
| Phase 4: Evaluation |
• Controlled failure injection • Performance benchmarking • Case study implementation • Practitioner feedback |
• Benchmark results • Case study reports • Validation data • Lessons learned |
• Quantitative metrics • Qualitative interviews • Comparative analysis • Statistical validation |
| Phase 5: Dissemination |
• Journal publications • Conference presentations • Industry workshops • Regulatory engagement |
• Academic papers • Practitioner guidelines • Policy recommendations • Open-source tools |
• Peer review • Industry adoption • Regulatory recognition • Citation impact |
Data Collection Strategy
- Incident Data: Weekly analysis of reported outages/breaches (ongoing collection)
- Laboratory Experiments: Controlled testbed simulations with failure injection
- Case Studies: Anonymised data from financial institutions (subject to ethical approval)
- Expert Interviews: IT resilience practitioners, regulators, cloud architects
- Surveys: Industry adoption barriers, risk perceptions, current practices
Ethical Considerations
All data collection will follow institutional ethical guidelines. Industry data will be anonymised; participants will provide informed consent; no live production systems will be disrupted during testing.
6. Expected Contributions
Theoretical Contributions
- Survivable Hybrid Cloud Framework: First comprehensive architectural framework combining cloud economics with local resilience for critical infrastructure
- Third-Party Risk Model: Taxonomy and measurement model for supply-chain vulnerabilities in interconnected SaaS ecosystems
- Resilience Maturity Model: Assessment framework enabling organisations to evaluate and improve their defensive IT posture
- War Scenario Planning Framework: Methodology for catastrophic failure preparation applicable to nation-state threats
Practical Contributions
- Reference Implementation: Working prototype demonstrating cloud→local replication with sub-minute failover
- Assessment Tools: Dependency scanner, risk calculator, vendor assessment matrix (open-source)
- Design Patterns: Catalogue of proven patterns for hybrid resilience implementation
- Regulatory Guidance: Mapping of architecture to Basel III, NIS2, DORA compliance requirements
- Industry Guidelines: Practical deployment guides for financial, manufacturing, and government sectors
Publications Plan
| Year | Target Venue | Paper Focus |
|---|---|---|
| 1 | IEEE Cloud Computing Conference | Incident analysis & pattern recognition |
| 2 | ACM Transactions on Internet Technology | Reference architecture & design patterns |
| 3 | Journal of Information Security | Risk assessment framework & validation |
| 3 | IEEE Transactions on Dependable Systems | Empirical evaluation & case studies |
| 4 | Practitioner venues (IEEE Security & Privacy) | Implementation guidelines & tools |
7. Research Timeline (4 Years)
- Comprehensive literature review
- Complete incident database (50+ cases)
- Design reference architecture
- Develop assessment tools
- Output: Conference paper on incident patterns
- Build prototype system
- Construct testbed environment
- Conduct controlled experiments
- Initial case study implementations
- Output: Journal paper on architecture
- Refine based on evaluation results
- Conduct industry case studies
- Validate with practitioners
- Regulatory framework mapping
- Output: Journal paper on validation & tools
- Thesis writing
- Final journal submissions
- Practitioner guidelines publication
- Open-source tool release
- Viva preparation
8. Candidate Profile
Academic Qualifications
- B.Sc. Electronic Engineering (1973) - Hardware/systems foundation
- M.Sc. Software Engineering - Software architecture & development expertise
Professional Experience
- 30+ years financial-sector systems consulting
- Specialisations: Large-scale software integration, operational resilience, continuity planning
- Experience with: Global banks, regulatory compliance, mission-critical systems
- Technical expertise: Distributed systems, database architecture, security engineering
Current Research Activity
- Ongoing incident analysis: Weekly breach/outage documentation (Nov 2025 onwards)
- Prototype development: Survivable hybrid cloud testbed in progress
- Tool development: Dependency scanner and risk calculator prototypes
- Case study collection: Anonymised data from consulting engagements
Why PhD Now
Professional experience reveals a critical gap between academic theory and industry practice in cloud resilience. Recent incidents demonstrate urgent need for rigorous research producing actionable solutions. PhD provides:
- Framework to systematically investigate complex socio-technical resilience problems
- Credibility to influence regulatory policy and industry standards
- Access to research resources and academic networks
- Opportunity to disseminate findings through peer-reviewed publications
9. Supervisory Requirements & Collaboration
Ideal Supervisory Team
Primary Supervisor
Computer Science/Information Systems with expertise in: distributed systems, cloud computing, resilience engineering, or fault tolerance
Co-Supervisor
Cybersecurity/Information Assurance with focus on: critical infrastructure protection, risk management, or security architecture
Industry Advisor (Optional)
Financial sector CTO/CISO or regulatory body representative for validation and industry engagement
Potential Collaboration Areas
- Cloud Computing Research Groups: Fault tolerance, availability, performance
- Cybersecurity Centres: Critical infrastructure protection, incident response
- Software Engineering: Design patterns, architecture evaluation
- Information Systems: Business continuity, governance, regulatory compliance
- Applied Mathematics: Risk modelling, reliability theory
Industry Partnerships
Existing professional network provides access to:
- Global financial institutions: Case study sites, validation environments
- Cloud service providers: Technical insights, architectural reviews
- Regulatory bodies: Policy alignment, compliance guidance
- Industry associations: Dissemination channels, practitioner feedback
10. Foundational Literature & Key References
Regulatory & Policy Frameworks
- Basel Committee on Banking Supervision (2023). Principles for Operational Resilience. Bank for International Settlements
- European Banking Authority (2024). Guidelines on ICT and Security Risk Management (EBA/GL/2024/XX)
- European Union Agency for Cybersecurity - ENISA (2024). Guidelines on Cloud Outage Resilience for Critical Infrastructure
- European Commission (2022). Digital Operational Resilience Act (DORA) - Regulation (EU) 2022/2554
- European Commission (2022). Network and Information Security Directive 2 (NIS2) - Directive (EU) 2022/2555
Research Methodology
- Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design Science in Information Systems Research. MIS Quarterly, 28(1), 75-105
- Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A Design Science Research Methodology for Information Systems Research. Journal of Management Information Systems, 24(3), 45-77
- Gregor, S., & Hevner, A. R. (2013). Positioning and Presenting Design Science Research for Maximum Impact. MIS Quarterly, 37(2), 337-355
Resilience Engineering
- Laprie, J. C. (2008). From Dependability to Resilience. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
- Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 5-9
- Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience Engineering: Concepts and Precepts. Ashgate Publishing
Cloud Computing & Distributed Systems
- Armbrust, M., et al. (2010). A View of Cloud Computing. Communications of the ACM, 53(4), 50-58
- Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599-616
- Gill, S. S., et al. (2022). Quantum and blockchain computing for cloud, IoT, and 5G networks: Challenges and opportunities. IEEE Internet of Things Journal, 9(15), 12827-12847
Supply Chain & Third-Party Risk
- Boyens, J., Paulsen, C., Moorthy, R., & Bartol, N. (2015). Supply Chain Risk Management Practices for Federal Information Systems and Organizations. NIST Special Publication 800-161
- Craigen, D., Diakun-Thibault, N., & Purse, R. (2014). Defining Cybersecurity. Technology Innovation Management Review, 4(10), 13-21
- Gordon, L. A., Loeb, M. P., & Zhou, L. (2020). Investing in Cybersecurity: Insights from the Gordon-Loeb Model. Journal of Information Security, 11(2), 49-59
Case Study & Incident Analysis
- National Cyber Security Centre - NCSC UK (2025). Annual Threat Assessment: Critical National Infrastructure
- Cloudflare Inc. (2025). Post-Incident Reports Archive [Online]. Available: https://www.cloudflarestatus.com
- Amazon Web Services (2025). AWS Service Health Dashboard - Historical Incidents
- Mandiant (2024). M-Trends 2024: A View from the Front Lines. FireEye/Mandiant Threat Intelligence
Emerging & Related Research
- Opara-Martins, J., Sahandi, R., & Tian, F. (2016). Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective. Journal of Cloud Computing, 5(1), 1-18
- Zhao, Q., et al. (2023). Understanding the Security Risks of Cloud Provider Supply Chains. In Proceedings of USENIX Security Symposium
- Recent publications on OAuth security, SaaS integration risks, and OT/IT convergence vulnerabilities (2023-2025)
11. Research Risks & Mitigation Strategies
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Industry access limitations | Medium | High | Leverage existing professional network; anonymisation protocols; laboratory simulations as alternative |
| Rapidly evolving technology landscape | High | Medium | Focus on architectural principles over specific technologies; iterative design methodology |
| Difficulty reproducing cloud failures | Medium | Medium | Use chaos engineering techniques; partner with cloud providers for controlled experiments |
| Regulatory framework changes | Medium | Low | Monitor regulatory developments; maintain flexible architecture adaptable to new requirements |
| Limited prior art in war scenarios | Low | Medium | Collaborate with defence researchers; use historical conflict analysis; scenario planning methodology |
12. Expected Impact & Beneficiaries
Academic Impact
- Novel theoretical framework bridging cloud computing, resilience engineering, and risk management
- Empirical evidence base for hybrid architecture effectiveness
- Methodological contribution to design science research in critical infrastructure
- Foundation for future research in defensive IT and strategic retreat concepts
Industry Impact
- Financial Sector: Practical architecture reducing systemic risk, improving regulatory compliance
- Manufacturing: OT/IT resilience patterns preventing production halts
- Retail/E-commerce: Customer data protection during SaaS provider incidents
- Government: Critical infrastructure protection against nation-state threats
- Cloud Providers: Understanding customer resilience requirements
Policy Impact
- Evidence-based recommendations for regulators (FCA, PRA, ECB)
- Input to evolving standards (ISO 27001, NIST frameworks)
- Contribution to NIS2 and DORA implementation guidance
- National cyber security strategy development
Societal Impact
- Reduced economic impact of future cloud outages
- Enhanced critical infrastructure resilience
- Improved public confidence in digital services
- Better preparation for catastrophic scenarios
13. Conclusion
This research addresses a critical and timely problem: the systemic vulnerability created by concentration of critical infrastructure operations on a small number of cloud service providers. Through rigorous design science methodology, it will develop, implement, and validate practical solutions enabling organisations to maintain operational resilience during catastrophic third-party failures.
The combination of academic rigor, professional expertise, and ongoing incident analysis positions this research to make significant theoretical and practical contributions. The "Survivable Hybrid Cloud" framework addresses not only current operational risks but also emerging threats including nation-state attacks on critical infrastructure.
Most importantly, this research produces immediately actionable outputs: reference architectures, assessment tools, design patterns, and implementation guides enabling organisations to begin improving their resilience posture during the research period rather than waiting for completion.
Research Readiness
Research has already commenced with weekly incident analysis and prototype development. A functional demonstration website implementing defensive coding principles has been built, embodying the core philosophy: zero external dependencies for critical infrastructure.
The candidate brings both academic credentials and three decades of relevant professional experience, providing unique positioning to bridge the gap between theoretical research and practical implementation in regulated environments.