PhD Research Proposal: Designing Survivable Hybrid Cloud Architectures for Critical Infrastructure Operational Resilience

Date: 7 November 2025 (Revised: December 2025) Author: R. Alan Axford Status: Pre-Application Research Proposal

Executive Summary

This research addresses a critical gap in cloud computing resilience: how organisations can maintain operational continuity during catastrophic third-party service failures. Through analysis of six major incidents in 2025 (AWS, JLR, M&S, Renault, Cloudflare, Gainsight), a pattern emerges—modern disruptions stem from configuration errors, supply-chain vulnerabilities, and identity-layer attacks rather than infrastructure failures. This proposal presents a design-science approach to develop, implement, and validate a "Survivable Hybrid Cloud" architecture combining cloud benefits with local failover capabilities, applicable across critical sectors including finance, healthcare, manufacturing, and government services.

Research Evolution & Scope Expansion

From Financial Outages to Systemic Third-Party Risk

Initial research focus (October 2025) centred on financial sector resilience following AWS outages. Subsequent weekly breach analysis revealed this is not a financial-sector problem but a cross-industry systemic vulnerability:

  • Manufacturing: JLR £1.3B loss, Renault £380M loss (OT/IT convergence risks)
  • Retail: M&S £300M loss (SaaS integration vulnerabilities)
  • Infrastructure: Cloudflare global disruption (centralised control-plane fragility)
  • SaaS Ecosystem: Gainsight breach (OAuth token supply-chain risk)
  • Cloud Providers: AWS cascade failures (maintenance-window exploitation)

Revised Scope: Research now addresses critical infrastructure resilience broadly, with financial sector as primary case study but automotive, retail, and government as validation domains.

1. Background and Rationale

Catalyst Event: 20 October 2025

The AWS Global/London Region outage on 20 October 2025 demonstrated how a planned maintenance window, when combined with a coordinated 6 Tbps DDoS attack, could cascade across multiple geographic regions, halting critical financial services for six hours. This incident prompted systematic investigation into third-party cloud dependencies.

Emerging Pattern: Configuration Over Infrastructure

Analysis of 2025 incidents reveals a fundamental shift in failure modes:

Historical Failures (Pre-2020) Modern Failures (2020-2025)
Hardware failures Configuration errors
Network outages Automation logic defects
Power failures Supply-chain compromises
Capacity exhaustion Identity-layer attacks
Physical attacks Control-plane vulnerabilities

Professional Context

Three decades of financial-sector systems consulting, specialising in large-scale software integration, operational resilience, and continuity planning, combined with electronic engineering and software engineering qualifications, provides both analytical capabilities and domain expertise to pursue this research.

Systemic Risk Identified

Current business continuity frameworks assume independent failures. Evidence shows:

  • Tier-2 suppliers serve multiple competitors (Renault/JLR shared vulnerability)
  • OAuth integrations create transitive trust chains (Gainsight→Salesforce cascade)
  • CDN concentration creates shared-fate domains (Cloudflare fronts 20%+ of top sites)
  • Credential reuse enables cross-border propagation (Renault France→UK→supplier network)

2. Research Gap

Existing literature addresses cloud security, availability metrics, and provider reliability in isolation. Critical gaps remain:

Gap 1: Holistic Resilience

No framework combines cloud benefits, local infrastructure, and governance in unified resilience model

Gap 2: Third-Party Risk

Supply-chain vulnerabilities in SaaS/PaaS ecosystems remain under-researched despite increasing incidents

Gap 3: Empirical Validation

Limited evidence showing how architectural patterns improve recovery in regulated environments

Gap 4: War Scenarios

No research on catastrophic infrastructure failure planning (nation-state attacks, physical infrastructure loss)

Novel Contribution

This research bridges these gaps through design-science approach: building and evaluating artefacts that demonstrate measurable improvements in recovery time, data integrity, and auditability under realistic failure scenarios including:

  • Total cloud provider unavailability
  • Supply-chain compromise cascades
  • Internet infrastructure attacks
  • Physical data centre seizure (war scenario)

3. Research Questions

Question Scope & Rationale
RQ1: Service Continuity How can a critical infrastructure organisation maintain service continuity during total third-party web-service loss?

Focus: Technical architecture enabling zero-downtime failover from cloud to local infrastructure
RQ2: Architectural Patterns What architectural and governance patterns characterise "survivable" hybrid cloud systems?

Focus: Design patterns, reference architectures, and maturity models for resilience
RQ3: Validation & Compliance How can these patterns be modelled, implemented, and empirically validated against RTO/RPO targets and regulatory frameworks (Basel III, NIS2, DORA)?

Focus: Measurable outcomes aligned with regulatory requirements
RQ4: Risk Assessment How can organisations quantify their third-party dependency risk exposure?

Focus: Practical tools for dependency scanning, risk scoring, and decision support
RQ5: War Scenarios What resilience strategies apply during catastrophic infrastructure failures (nation-state attacks, physical seizure, internet partitioning)?

Focus: Extreme but realistic scenarios requiring complete autonomy from external services

4. Research Objectives

Primary

Design Reference Architecture: Create validated hybrid-cloud resilience architecture integrating cloud databases, local infrastructure, DNS autonomy, and identity management

Primary

Develop Resilience Metrics: Establish quantitative measures for availability, integrity, latency, and recovery capability across failure scenarios

Primary

Implement Working Prototype: Build functional system demonstrating real-time cloud→local replication with automatic failover

Primary

Empirical Validation: Evaluate through controlled failure injection, transactional workloads, and comparison against traditional DR approaches

Secondary

Create Practical Tools: Develop dependency scanner, risk calculator, and vendor assessment matrix for industry use

Secondary

Publish Guidelines: Produce design patterns and implementation guidance for financial sector, manufacturing, and critical infrastructure adoption

Secondary

Regulatory Alignment: Map architecture to Basel III, NIS2, DORA, and other frameworks; provide compliance guidance

5. Methodology

Design Science Research (DSR) Approach

Following Hevner et al. (2004), this research employs iterative design-build-evaluate cycles to create and validate artefacts addressing identified problems.

Phase Activities Artefacts Evaluation Methods
Phase 1:
Problem Identification
• Systematic incident analysis
• Literature review
• Practitioner interviews
• Regulatory requirement analysis
• Incident database (ongoing)
• Problem taxonomy
• Requirements specification
• Pattern recognition
• Stakeholder validation
Phase 2:
Solution Design
• Reference architecture design
• Pattern catalogue development
• Metric framework creation
• Risk assessment model
• Hybrid resilience architecture
• Design patterns
• Maturity model
• Assessment tools
• Expert review
• Theoretical validation
• Feasibility analysis
Phase 3:
Implementation
• Prototype development
• Testbed construction
• Tool implementation
• Documentation creation
• Working prototype system
• Dependency scanner
• Risk calculator
• Deployment guides
• Unit testing
• Integration testing
• Security assessment
Phase 4:
Evaluation
• Controlled failure injection
• Performance benchmarking
• Case study implementation
• Practitioner feedback
• Benchmark results
• Case study reports
• Validation data
• Lessons learned
• Quantitative metrics
• Qualitative interviews
• Comparative analysis
• Statistical validation
Phase 5:
Dissemination
• Journal publications
• Conference presentations
• Industry workshops
• Regulatory engagement
• Academic papers
• Practitioner guidelines
• Policy recommendations
• Open-source tools
• Peer review
• Industry adoption
• Regulatory recognition
• Citation impact

Data Collection Strategy

  • Incident Data: Weekly analysis of reported outages/breaches (ongoing collection)
  • Laboratory Experiments: Controlled testbed simulations with failure injection
  • Case Studies: Anonymised data from financial institutions (subject to ethical approval)
  • Expert Interviews: IT resilience practitioners, regulators, cloud architects
  • Surveys: Industry adoption barriers, risk perceptions, current practices

Ethical Considerations

All data collection will follow institutional ethical guidelines. Industry data will be anonymised; participants will provide informed consent; no live production systems will be disrupted during testing.

6. Expected Contributions

Theoretical Contributions

  1. Survivable Hybrid Cloud Framework: First comprehensive architectural framework combining cloud economics with local resilience for critical infrastructure
  2. Third-Party Risk Model: Taxonomy and measurement model for supply-chain vulnerabilities in interconnected SaaS ecosystems
  3. Resilience Maturity Model: Assessment framework enabling organisations to evaluate and improve their defensive IT posture
  4. War Scenario Planning Framework: Methodology for catastrophic failure preparation applicable to nation-state threats

Practical Contributions

  1. Reference Implementation: Working prototype demonstrating cloud→local replication with sub-minute failover
  2. Assessment Tools: Dependency scanner, risk calculator, vendor assessment matrix (open-source)
  3. Design Patterns: Catalogue of proven patterns for hybrid resilience implementation
  4. Regulatory Guidance: Mapping of architecture to Basel III, NIS2, DORA compliance requirements
  5. Industry Guidelines: Practical deployment guides for financial, manufacturing, and government sectors

Publications Plan

Year Target Venue Paper Focus
1 IEEE Cloud Computing Conference Incident analysis & pattern recognition
2 ACM Transactions on Internet Technology Reference architecture & design patterns
3 Journal of Information Security Risk assessment framework & validation
3 IEEE Transactions on Dependable Systems Empirical evaluation & case studies
4 Practitioner venues (IEEE Security & Privacy) Implementation guidelines & tools

7. Research Timeline (4 Years)

Year 1: Foundation & Design
  • Comprehensive literature review
  • Complete incident database (50+ cases)
  • Design reference architecture
  • Develop assessment tools
  • Output: Conference paper on incident patterns
Year 2: Implementation & Initial Validation
  • Build prototype system
  • Construct testbed environment
  • Conduct controlled experiments
  • Initial case study implementations
  • Output: Journal paper on architecture
Year 3: Refinement & Industry Validation
  • Refine based on evaluation results
  • Conduct industry case studies
  • Validate with practitioners
  • Regulatory framework mapping
  • Output: Journal paper on validation & tools
Year 4: Completion & Dissemination
  • Thesis writing
  • Final journal submissions
  • Practitioner guidelines publication
  • Open-source tool release
  • Viva preparation

8. Candidate Profile

Academic Qualifications

  • B.Sc. Electronic Engineering (1973) - Hardware/systems foundation
  • M.Sc. Software Engineering - Software architecture & development expertise

Professional Experience

  • 30+ years financial-sector systems consulting
  • Specialisations: Large-scale software integration, operational resilience, continuity planning
  • Experience with: Global banks, regulatory compliance, mission-critical systems
  • Technical expertise: Distributed systems, database architecture, security engineering

Current Research Activity

  • Ongoing incident analysis: Weekly breach/outage documentation (Nov 2025 onwards)
  • Prototype development: Survivable hybrid cloud testbed in progress
  • Tool development: Dependency scanner and risk calculator prototypes
  • Case study collection: Anonymised data from consulting engagements

Why PhD Now

Professional experience reveals a critical gap between academic theory and industry practice in cloud resilience. Recent incidents demonstrate urgent need for rigorous research producing actionable solutions. PhD provides:

  1. Framework to systematically investigate complex socio-technical resilience problems
  2. Credibility to influence regulatory policy and industry standards
  3. Access to research resources and academic networks
  4. Opportunity to disseminate findings through peer-reviewed publications

9. Supervisory Requirements & Collaboration

Ideal Supervisory Team

Primary Supervisor

Computer Science/Information Systems with expertise in: distributed systems, cloud computing, resilience engineering, or fault tolerance

Co-Supervisor

Cybersecurity/Information Assurance with focus on: critical infrastructure protection, risk management, or security architecture

Industry Advisor (Optional)

Financial sector CTO/CISO or regulatory body representative for validation and industry engagement

Potential Collaboration Areas

  • Cloud Computing Research Groups: Fault tolerance, availability, performance
  • Cybersecurity Centres: Critical infrastructure protection, incident response
  • Software Engineering: Design patterns, architecture evaluation
  • Information Systems: Business continuity, governance, regulatory compliance
  • Applied Mathematics: Risk modelling, reliability theory

Industry Partnerships

Existing professional network provides access to:

  • Global financial institutions: Case study sites, validation environments
  • Cloud service providers: Technical insights, architectural reviews
  • Regulatory bodies: Policy alignment, compliance guidance
  • Industry associations: Dissemination channels, practitioner feedback

10. Foundational Literature & Key References

Regulatory & Policy Frameworks

  1. Basel Committee on Banking Supervision (2023). Principles for Operational Resilience. Bank for International Settlements
  2. European Banking Authority (2024). Guidelines on ICT and Security Risk Management (EBA/GL/2024/XX)
  3. European Union Agency for Cybersecurity - ENISA (2024). Guidelines on Cloud Outage Resilience for Critical Infrastructure
  4. European Commission (2022). Digital Operational Resilience Act (DORA) - Regulation (EU) 2022/2554
  5. European Commission (2022). Network and Information Security Directive 2 (NIS2) - Directive (EU) 2022/2555

Research Methodology

  1. Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design Science in Information Systems Research. MIS Quarterly, 28(1), 75-105
  2. Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A Design Science Research Methodology for Information Systems Research. Journal of Management Information Systems, 24(3), 45-77
  3. Gregor, S., & Hevner, A. R. (2013). Positioning and Presenting Design Science Research for Maximum Impact. MIS Quarterly, 37(2), 337-355

Resilience Engineering

  1. Laprie, J. C. (2008). From Dependability to Resilience. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
  2. Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 5-9
  3. Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience Engineering: Concepts and Precepts. Ashgate Publishing

Cloud Computing & Distributed Systems

  1. Armbrust, M., et al. (2010). A View of Cloud Computing. Communications of the ACM, 53(4), 50-58
  2. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599-616
  3. Gill, S. S., et al. (2022). Quantum and blockchain computing for cloud, IoT, and 5G networks: Challenges and opportunities. IEEE Internet of Things Journal, 9(15), 12827-12847

Supply Chain & Third-Party Risk

  1. Boyens, J., Paulsen, C., Moorthy, R., & Bartol, N. (2015). Supply Chain Risk Management Practices for Federal Information Systems and Organizations. NIST Special Publication 800-161
  2. Craigen, D., Diakun-Thibault, N., & Purse, R. (2014). Defining Cybersecurity. Technology Innovation Management Review, 4(10), 13-21
  3. Gordon, L. A., Loeb, M. P., & Zhou, L. (2020). Investing in Cybersecurity: Insights from the Gordon-Loeb Model. Journal of Information Security, 11(2), 49-59

Case Study & Incident Analysis

  1. National Cyber Security Centre - NCSC UK (2025). Annual Threat Assessment: Critical National Infrastructure
  2. Cloudflare Inc. (2025). Post-Incident Reports Archive [Online]. Available: https://www.cloudflarestatus.com
  3. Amazon Web Services (2025). AWS Service Health Dashboard - Historical Incidents
  4. Mandiant (2024). M-Trends 2024: A View from the Front Lines. FireEye/Mandiant Threat Intelligence

Emerging & Related Research

  1. Opara-Martins, J., Sahandi, R., & Tian, F. (2016). Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective. Journal of Cloud Computing, 5(1), 1-18
  2. Zhao, Q., et al. (2023). Understanding the Security Risks of Cloud Provider Supply Chains. In Proceedings of USENIX Security Symposium
  3. Recent publications on OAuth security, SaaS integration risks, and OT/IT convergence vulnerabilities (2023-2025)

11. Research Risks & Mitigation Strategies

Risk Likelihood Impact Mitigation
Industry access limitations Medium High Leverage existing professional network; anonymisation protocols; laboratory simulations as alternative
Rapidly evolving technology landscape High Medium Focus on architectural principles over specific technologies; iterative design methodology
Difficulty reproducing cloud failures Medium Medium Use chaos engineering techniques; partner with cloud providers for controlled experiments
Regulatory framework changes Medium Low Monitor regulatory developments; maintain flexible architecture adaptable to new requirements
Limited prior art in war scenarios Low Medium Collaborate with defence researchers; use historical conflict analysis; scenario planning methodology

12. Expected Impact & Beneficiaries

Academic Impact

  • Novel theoretical framework bridging cloud computing, resilience engineering, and risk management
  • Empirical evidence base for hybrid architecture effectiveness
  • Methodological contribution to design science research in critical infrastructure
  • Foundation for future research in defensive IT and strategic retreat concepts

Industry Impact

  • Financial Sector: Practical architecture reducing systemic risk, improving regulatory compliance
  • Manufacturing: OT/IT resilience patterns preventing production halts
  • Retail/E-commerce: Customer data protection during SaaS provider incidents
  • Government: Critical infrastructure protection against nation-state threats
  • Cloud Providers: Understanding customer resilience requirements

Policy Impact

  • Evidence-based recommendations for regulators (FCA, PRA, ECB)
  • Input to evolving standards (ISO 27001, NIST frameworks)
  • Contribution to NIS2 and DORA implementation guidance
  • National cyber security strategy development

Societal Impact

  • Reduced economic impact of future cloud outages
  • Enhanced critical infrastructure resilience
  • Improved public confidence in digital services
  • Better preparation for catastrophic scenarios

13. Conclusion

This research addresses a critical and timely problem: the systemic vulnerability created by concentration of critical infrastructure operations on a small number of cloud service providers. Through rigorous design science methodology, it will develop, implement, and validate practical solutions enabling organisations to maintain operational resilience during catastrophic third-party failures.

The combination of academic rigor, professional expertise, and ongoing incident analysis positions this research to make significant theoretical and practical contributions. The "Survivable Hybrid Cloud" framework addresses not only current operational risks but also emerging threats including nation-state attacks on critical infrastructure.

Most importantly, this research produces immediately actionable outputs: reference architectures, assessment tools, design patterns, and implementation guides enabling organisations to begin improving their resilience posture during the research period rather than waiting for completion.

Research Readiness

Research has already commenced with weekly incident analysis and prototype development. A functional demonstration website implementing defensive coding principles has been built, embodying the core philosophy: zero external dependencies for critical infrastructure.

The candidate brings both academic credentials and three decades of relevant professional experience, providing unique positioning to bridge the gap between theoretical research and practical implementation in regulated environments.