A well-built business continuity plan (BCP) is critical for financial institutions. In this Q&A, we answer questions received during our Ultimate Business Continuity Q&A webinar. Read on as we address common questions on business continuity management, including regulatory expectations, best practices for testing, vendor resilience, and more to help you strengthen your BCP and ensure operational resilience.
Table of contents
Q: What are regulatory expectations and best practices on the scope of testing for business continuity?
Steve Fochler: Broadly speaking, regulatory expectations (as outlined in the FFIEC IT examination handbook) emphasize:
Remember, the key to a successful business continuity plan (BCP) is regularly testing, updating, and improving your plan based on new insights and changing business environments. Those who practice are better prepared!
Q: Where do I start with building a BCP from ground zero?
Steve: Starting from scratch can be an advantage. We recommend following a defined process for successful development of Business Continuity Management (BCM) programs. This typically includes:
It's crucial to automate most of the process to keep the program current and stay ahead of ever-changing risks. Consider using specialized software like Ncontinuity to streamline this process. Ncontinuity has features which shorten development time.
Q: What are best practices for individual BIA functions?
Steve: You want to strike a balance between BIA detail at the business process level versus department level. Ensure your BIA is complete at the department level. Capture recovery requirements such as Maximum Tolerable Downtime (MTD), Recovery Time Objective (RTO), and Recovery Point Objective (RPO), as well as impact of loss and workarounds for the business process level.
Related: Business Continuity Planning vs. Disaster Recovery: the Difference
Q: How can you test business operations resiliency?
Steve: It depends on whether your critical systems supporting operations are on premise, the cloud, or a hybrid.
For example, if your core is cloud-based and your branch network is independent of HQ, then branch locations may be a good alternate location redundancy from corporate headquarters when it comes to access to the cloud-based core and all other critical applications. To test its resiliency, key personnel might travel to the most remote branch location, log in to its systems, produce screen shots for proof, etc. It’s a simple example, but it would test operational resiliency in the event of loss at corporate facilities.
Just as important, test alternative processing methods as though the primary method was down or unavailable. Make sure manual workarounds actually work. If under a ransomware attack, manual workarounds will be the only way your organization can continue to operate.
Q: Can you share some examples of business operations resiliency testing?
Steve: There are two primary ways plans are tested: tabletop exercises and functional exercises.
Desktop or tabletop exercises are the easiest to conduct because they are primarily a discussion. Functional exercises includes testing systems required resources for recovery such as alternate processing systems, platforms, sites, etc.
Conducting a tabletop exercise involves discussions of manual workarounds for critical business processes supported by systems which may be impacted by the disaster scenario selected.
Related: 9 Steps to an Effective Tabletop BCP Test
Start out by producing a list of your critical business processes (those with an MTD of 24 hours or less). Use this list to walk through during the tabletop. Have function owners discuss the impact of loss and their manual workarounds.
For those of highest concern, conduct a functional exercise testing the manual workaround during the tabletop. For instance, as a transaction (deposit, wire transfer, etc.) is conducted, have staff write the transaction down on paper. Or write the transaction on paper and enter it into system later. After using this process for one day, review the results to understand the volume, backlogs, delays, and other issues.
Another functional exercise example is where systems that support critical business processes are tested by accessing the application by alternative means. Examples include logging in from another branch location, accessing a system restored to another server or recovery site, or participating in a third-party system recovery test to confirm their disaster recovery plan.
Don't forget to include evacuation or shelter in place drills.
Finally, make sure your crisis or executive management team is practicing emergency meetings, even if they just include these conversations during their executive meetings. Like anything in life – learning how to ride a bike, drive a car, golf, football, etc. – the more you practice something like disaster recovery, the better you will be.
Q: How can I better prepare for testing BCP and building tasks into the events without telling the people involved what the "test scenario" will be before the test?
Steve: It starts with the workarounds associated with the critical business function or system being tested. The workarounds should be manual in nature and must work irrespective of what caused the disruption to the business process or system.
You can communicate that you will be asking the people involved about their manual workarounds and how long these workarounds will last before they cannot get caught back up.
Q: What are the advantages and disadvantages of desktop testing versus other types of testing?
Steve: Desktop or tabletop exercises are primarily discussion-based and are the easiest to conduct. Tabletops are easier to organize and require fewer resources compared to functional exercises that involve testing actual systems. However, functional exercises provide a more realistic test of your recovery capabilities.
Q: How do you test a plan when all of your staff works remotely?
Steve: Test just like you work now. For instance, use online meetings to conduct tabletop or functional exercises.
You’ll also need a plan if your online meeting service (MS Teams, Zoom, GoToMeeting, etc.) is down. How will you communicate? Text? Email?
Then practice that scenario. Conduct your tabletop exercise virtually but have that conversation where your online meeting service provider is down or there’s a regional internet outage.
Free download: Work-From-Home Risk Assessment
Q: Can unexpected incidents (like internet outages or software program outages) count as testing?
Steve: While these incidents can provide valuable insights, they shouldn't replace planned testing. Use these unexpected events to supplement your regular testing program and identify areas for improvement.
Q: What types of pandemic tests should we run? They say pandemics vary, but in the end won’t they all end up with similar outcomes?
Steve: Two key pandemic-related impacts to test are remote work (as we experienced with the COVID lockdowns) and – maybe more important – cross training personnel for critical business functions. If, for instance, your wire team is sick, you need to make sure someone else can step in to run wire operations until these people are well.
Related: Business Continuity Management vs Pandemic Planning
Q: What are some scenarios other financial institutions are using?
Steve: The most popular scenarios I see clients using right now are long-term outages of critical systems and associated infrastructure, forcing critical business function owners to think of manual workarounds to be implemented for a lengthy period. Examples include ransomware events with core systems, regional internet outages affecting all systems, and ice storms knocking out power beyond the capacity of generator fuel.
Q: Do you see business continuity needs changing for financial institutions due to all of the security breaches?
Steve: Absolutely. Recent security breaches are having a major impact on business continuity. This is mostly associated with manual workaround planning when systems supporting critical business processes are down.
Not only should you develop manual workarounds, but these manual workarounds will need to be able to sustain the function for longer periods of time than typically expected (such as weeks versus hours or a few days).
The bottom line is business continuity needs to work with your cyber resiliency program which encompasses all aspects of your cyber security program. See NIST cyber security framework (CSF) 2.0 for more information.
Related: What Bankers Need to Know About NIST 2.0
Q: What are best practices to correct and alleviate maximum tolerable downtime (MTD) gaps found throughout the BCP process?
Steve: Identifying the gaps is the first step. Then you have two options: You can mitigate them or temper expectations.
For mitigation, you can use the gaps you identify as ammunition when asking executive management and the board of directors for more funds for better system resiliency solutions. The goal would be to reduce the actual MTD to align with business function owners' expectations.
If this isn’t possible, then work with the business process owner to identify manual workarounds that will work for the actual MTD period of time, which is longer than their expectation. This is what I mean by tempering their expectations.
You may not be investing in better redundancies, but you're still responsible for recovering your business process so think through and develop manual workarounds.
Related: Business Resiliency: Your Guide to Business Continuity Management
Q: What are the 3 most important factors to consider when performing 3rd party risk assessments?
Steve: The three most important things to consider are:
Focus on number 1. After this risk factors 2 and 3 will take care of themselves in my opinion.
Q: How are companies addressing the risk of unavailability with large third-party service providers supplying critical services such as email, multi-factor authentication (MFA), and customer-facing applications through the cloud?
Steve: Most are not addressing it, and it can be a major problem.
The recent CrowdStrike outage disrupted many businesses who use Microsoft Azure or Office 365, which most of us use on a regular basis. Unfortunately, we don't address these types of large third-party service provider outages until they happen. Then it's critical mass and potentially too late.
The best thing you can do is to push the limits of your testing to include widespread, long-term outages of your critical systems. Determine what the alternative processing mode will be, including potentially manual workarounds, then actually test these alternatives and workarounds to prove they will work.
Q: What is the relationship between BCP and Banking as a Service (BaaS)?
Steve: When providing services through a fintech, you still have an obligation to protect these services the way you would for traditional customers and members. Apply your business continuity principles to BaaS including BIA, risk assessment, planning and of course testing.
Q: Many service level agreements (SLAs) are lax and light in fiscal or scope accommodations if/when a breach happens. What can you do during contract negotiations to protect customers and the institution?
Steve: Service level agreements are lax if you allow the vendor to dictate the contract terms. You need to negotiate contract terms and conditions which protect your customers’ or members’ data and ensure SLAs clearly define recovery requirements such as MTD, RTO and RPO.
Demand the ability to receive the third party’s test results so you can identify their actual recovery times, RTO and RPO. Use these to compare the department's expectations for recovery of these third-party systems. Outside of having it in the contract, lean on them to get this information and if they still push back, leverage your examiner and audit firms to give you the ammunition to force receipt of these testing results. Most critical vendors will comply before you get to this point.
Q: With increased reliance on third-party and cloud service providers, what recommendation do you have for establishing, monitoring, and reporting key risk indicators (KRIs) and SLAs aligned to resiliency and business continuity planning?
Steve: Our reliance on third parties continues to grow, and the associated risks grow with it as the control over your data and systems is relinquished.
It's a trust-but-verify relationship with the verify portion being a vital process. From a key risk indicator standpoint, you'll need to develop measurement around percentages. What percentage of your critical business processes are supported by systems that are outsourced?
Then you’ll weigh these. For example, if your core processing is outsourced, this will have a heavier weight than other systems.
Finally, you can create a KRI around recovery requirement compliance rates. An example is the percentage of third-party recovery time objectives (RTO) that meets or exceeds the expectations you have for recovery via test results they've documented and provided to you as verification. It’s the same with RPO.
If you can participate in their tests, do it. This increases your comfort level relative to verification.
Q: How do you determine resilience for a vendor if they are not deemed critical from a vendor management standpoint? What if there is little to no due diligence completed since no protected information is shared but the vendor is critical to a business line for ongoing operations?
Steve: Criticality determines the level of ongoing due diligence you will need for your vendors. Due diligence requirements shouldn't be limited to those vendors handling nonpublic personal information (NPI). If a vendor is critical to a business line (such as an internet provider or email), it could still be a critical vendor requiring significant due diligence.
Q: How do you make sure all key processes and vendors are identified?
Steve: Meet with department subject matter experts and conduct tabletop exercises discussing major outages (power, internet, etc.). Incidents that expose issues can also help identify overlooked processes and vendors.
Q: What should we think about when it comes to exit strategies in BCP?
Steve: Assuming you're talking about exit strategies for executive management or key personnel, at a minimum you need multiple personnel cross trained in critical business functions.
For key positions such as chief information officer (CIO), chief information security officer (CISO), and chief financial officer (CFO), it’s important to identify a replacement candidate with proper skill sets in advance. Some organizations are considering outsourcing to a visiting CISO instead of trying to hire a replacement.