Tag Archives: Disaster Recovery

What is a Disaster Recovery Plan and why Does Your Organization Need One?

Disaster recovery plans are an important part of any organization’s risk management process. A disaster recovery plan helps ensure that your organization will be able to continue functioning in the event of a disaster, reducing downtime and the resulting loss.

Disaster recovery planning is a process that requires careful consideration of the risks your organization faces and the likelihood of those risks materializing. It also requires that you think about the consequences of those risks coming to fruition.

Disaster recovery plans are flexible, and they can be put into place in a variety of ways. No two organizations have the same needs or the same capabilities, so a disaster recovery plan is something that you’ll need to think long and hard about before you can come up with a solution that’s right for your organization.

***

The first step is to identify the risks that your organization faces and then the likelihood of those risks coming to fruition. This will help you to determine what your disaster recovery plan needs to cover. Some of the risks that most organizations face are:

  • Natural disasters (such as floods, hurricanes, earthquakes)
  • Man-made disasters (such as building fires or explosions)
  • Human error (such as power outages)
  • Internal and external security breaches
  • Technical problems (such as electrical surges)

Once you’ve identified the risks and potential consequences, you’ll be able to determine what your disaster recovery plan needs to cover. This is a two-step process, since you’ll need to figure out how to prevent the disaster and then what to do if the disaster does occur.

Step one is to prevent the disaster from occurring, which is best done by conducting a risk assessment. This will help you to determine what you can do to prevent the disaster from happening.

Step two is to develop a plan of action for if the disaster does occur. This means figuring out what you need to do to minimize the damage and get back to a normal business operation as quickly as possible.

A disaster recovery plan helps an organization to recover quickly from a disaster. It also helps an organization to minimize the impact of the disaster, which reduces the resulting downtime and loss.

The key to creating a successful disaster recovery plan is to identify the risks your organization faces and to conduct a thorough risk assessment. This will help you to create a plan that will minimize the impact of any future disasters.

***

Disaster recovery in the real world

Recently, many organizations around the world were impacted by a SalesForce outage that lasted close to 4 hours. During this time, users were unable to access the tool and many of SalesForce’s own websites (including its status page) were offline or inaccessible.

If your organization depends on SalesForce to communicate with its clients like mine does, this outage was nothing short of catastrophic. While we were able to “get by” through alternative methods and could continue working, our efficiency was significantly impacted.

We were fortunately able to inform our clients of the disruption so there was some understanding but in the early stages of the outage, no-one really knew what to do. Some key takeaways for us were:

  • Explore backup communication channels to inform clients. This includes our own internal status pages as well as distribution lists and other in-app communication tools.
  • Launch internal triage channels faster with specific teams/individuals to ensure that mitigation steps are being taken faster. Consider creating a cross-company “emergency task force”.
  • Provide alternative access to systems (we found during the outage that SalesForce classic was still accessible while Lighting was not) to key teams and individuals. We were also able to continue communicating directly via email but this wasn’t ideal as multiple agents could access the same message.
  • Have an easily accessible list of names and numbers for key members of the team. This should be in a shared location in multiple systems (SalesForce Knowledge, Google Drive, Confluence etc.) as well as printed as a hard copy in case network access is an issue also.

With the recent Canada-wide Roger’s network outage in April still fresh, this new SalesForce outage makes it clear how dependent we’ve all become on technology. Perhaps more importantly, it also highlights the importance of not ignoring disaster recovery planning as the next outage could come at you out of nowhere.

The Impact To Your Business of “Unplanned Work”

You can’t feel it, you can’t see it, you can’t touch it, but unplanned work is silently killing your business. How many times have you finished your day at work, very exhausted yet unable to cross anything off your list of high-priority to do things? This can make you feel robbed. It is impossible to tell where your time went to. This is a sign that you are falling victim to unplanned work.
 
Unplanned work is not evident in our metrics of performance, so it is difficult to analyze. However, the impact of unplanned work is great. It can mask dependencies and block and stall important priorities. Risk will accumulate across your system if you receive more work and start it late. The risk will continue to yield more risks in the form of unknown dependencies, neglected work, too much WIP and conflicting priorities. This can jeopardize the ability of your organization to deliver predictably.
 
 

Effects of Unplanned Work

You can encounter unplanned work in 2 different flavors:
  • The requirements that you have to change in the middle of a project
  • The defects you find during testing. 

You can deal with both of them if you have an agile team that is run well. It is important to understand the cause of unplanned work if you want to make your project cheap and predictable.

 
It is also important to understand that unplanned work steals valuable time from planned work. Most people see unplanned work as a norm instead of seeing it as a significant problem. 

When solving a problem, you should ensure that the problem will not occur again, but if you solve the symptoms of a problem rather than the problem, then the problem is likely to occur again. The main problem that is left unattended will lead to more issues which will lead to the allocation of more resources, and this can worsen things.
 
Every organization, department, team, and individual should measure the amount of unplanned work that is being performed. You can even forget about solving problems that come later and get a sense of how much time is allocated to work that isn’t adding any value to the business and rather spent on fighting the status quo. 
 
Let us consider release management as a concrete example to explain unplanned work. You take a piece of software and deploy it to the market and spend a lot of time trying to diagnose why a certain release didn’t work as expected. The time you spend trying to diagnose the problem is unplanned work. You can easily solve some of these issues through deployment, continuous integration and automated testing. You can be proactive and resolve a lot of (future) unplanned work as you try to diagnose and debug why things didn’t work as you had expected.
 
It is difficult to get rid of unplanned work, but we can easily learn how to plan for it. The only concern is how to plan and tackle unplanned work. Before we go deep and discuss how you can handle and mitigate unplanned work, it is important to discuss incident management, problem management, and post-mortems because they can help plan for the unplanned.
 

Incident Management and Problem Management

Problem management and incident management are key components of the Information Technology Infrastructure Library service model, and they have been created to provide a more streamlined service to consumers.
 

Incident Management

Incidents are things related to customer contacts. They can be account updates, information request, and issue reporting. There are different methods that incidents can be reported through, and this includes email, phone, and chat. Incidents can also be generated by automatic monitoring tools. There are different incidents that come through/from different sources. They would get routed to your tool for incident management. This could be something simple for smaller teams, but larger teams may need enterprise level tools or in-house customer-built applications.
 
Your business should have the responsibility of reviewing information about different incidents and check if there is a solution available to the customer. An example of an incident is when a customer wants to change their account password. The helpdesk will receive the incident, get the necessary information and make sure the customer has passed all the security checks, facilitate the changing of the password and inform the customer that the password has been changed and then close the incident. Some incidents can be managed with automatic tools while others have to be managed manually.
 

Problem Management

Incident management is repetitive in nature and can get tedious. This can exhaust the more skilled employees in your organization so if you have such employees in your organization, consider moving them into managing problems.
 
Problem management is deeper than incident management. This is where a single problem causes multiple incidents from multiple clients or customers. Problem management needs the best people. The role of these people is to find out why a certain problem happened and find the best ways to fix it and prevent the problem from happening again.
 

Post-mortem

A post-mortem is usually performed after a project has been concluded. The process helps to determine and analyze different elements of the concluded projects that were successful or unsuccessful. The main purpose of project post-mortem is to inform improvements in processes to mitigate future risks. This helps to promote best practices in an organization. Post-mortem helps to manage risks in an organization.
 

Mitigating unplanned work

As we had already seen, unplanned work is time-consuming, expensive and can negatively impact other projects in your organization. 
 
It can also drain all your skilled resources. This can be the rarest and the most important resources you have in your organization. This is why unplanned work hurts more. Two main methods you can use to mitigate unplanned work are:
  • Widening the bottleneck to moving configurations down and upstream without having to tie down your constrained skill base.
  • Increasing communication flow between producers and developers to relevant changes.

 

When you widen the bottleneck, it turns hours of work into a few minutes. If something goes wrong, you can always roll back and mitigate the impact of unplanned work. You can widen the bottleneck by creating a build of configurations, automating activities of migration of configurations and creating other jobs.
 
When the flow of communication is increased, your developers will know any changes to production that impacts their activities directly. They will be able to run comparisons instantly against critical systems and see if they have to consider specific changes as they perform their work. This helps to prevent failed deployments or reduce the likelihood of failed deployments. 
 

Customer relationship management

This is somewhat self-explanatory. Customer relationship management (CRM) is a strategy used by a business to execute business objectives and meet the requirements of their customers. 
 
The service strategy can be used to improve the customer relationship strategy within a business and ensure that the business can create value for its shareholders and customers by contributing to the value. The strategy ensures that businesses are able to organize their operations in an appropriate manner to deliver services that will enable the success of customers. 
 

Document Objectives

 
The business objectives are the results you want to maintain and achieve as you run your business. Every business should have clear and attainable goals. 
 
The goals may include productivity, profitability, customer service, core values, employee retention, growth, change management, marketing, maintain financing, competitive analysis and more.
 

Document Requirements

If you want to arrive at a solution that will ensure the continuity of your business, you need to map out every application; server, data, and software solution set in your environment. You should assign a downtime tolerance that is business-driven for each of the requirements. Map out the application interdependencies as you do this.
 
 

Test Your DR Planning

 
Disaster Recovery is never a onetime event. This is a constant process, and it is important for disaster recovery to keep up with the changes in the environment and evolving service-level agreements. The truth is that the data center is rapidly changing and it is almost impossible for the change control processes and operations to keep up. It is also very hard to conduct disaster recovery tests with enough frequency to be meaningful. This leads to most companies considering disaster recovery and tools for monitoring to allow analysis that is a near-real-time of the disaster recovery processes and setup.
 

 

Every business should conduct disaster recovery tests to determine the people who need more training and also know the disaster recovery processes that need to be refined. Some of the main disaster recovery things that a business should have include monitoring, environment awareness, hardware, and software independence and work from a knowledge base.