July 2022 Rogers outage blamed on human error and management. 24

July 2022 Rogers outage blamed on human error and management.

Random Image

Nearly two years after the infamous July 8th Rogers outage in 2022, an independent review conducted for the Canadian Radio-television and Telecommunications Commission (CRTC) blames human error. The review says the outage was worsened by management and system “deficiencies.”

The CRTC commissioned Xona Partners in September 2023 to review the outage, a summary of which is now available on the CRTC website. Xona also looked at whether Rogers took sufficient steps to prevent another outage. The CRTC plans to release the full report at an unspecified date once it has redacted sensitive information.

Xona’s report summary details how Rogers was working on a seven-phase process to upgrade its IP core network — the outage occurred during the sixth phase of the upgrade process. According to the summary, an error in configuring distribution routers, which involved Rogers staff removing the Access Control List policy filter, caused the core network to be flooded with IP routing information that ultimately crashed the core network.

Staff removed the policy filter in an effort to “clean up” files in the distribution routers, and the “change management process, which includes audits of change parameters, failed to flag the erroneous configuration change.”

Also of note, the review summary says that Rogers’ core network hadn’t been configured with an overload limit, which meant there was no protection mechanism for the core network beyond the policy filter.

Moreover, the review summary notes that Rogers assessed the risk of the network upgrade as ‘high,’ but the company’s risk assessment algorithm downgraded the risk level following the successful completion of the first five phases. The ‘low’ risk rating meant staff weren’t required to conduct additional scrutiny for the upgrades, like conducting lab testing for the change or seeking higher levels of approval.

“The July 2022 outage was not the result of a design flaw in the Rogers core network architecture. However, with both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services,” said the review summary. Additionally, it pointed out that it’s common for service providers to share an IP core network between wireless and wireline services to “balance cost with performance.”

Elsewhere, the review summary looked at factors that impacted network restoration. These include that Rogers’ management network relied on the IP core network, resulting in remote employees being unable to access the management network. “Rogers did not provision its network operation centre and other critical remote infrastructure sites with redundant connectivity from alternative service providers for network management.”

In a similar vein, the review pointed out that Rogers staff relied on the company’s network for their mobile and internet services, so when both networks failed, staff weren’t able to communicate effectively.

Finally, the review said Rogers staff “did not initially have access to the error logs from the failed routers and could not pinpoint the root cause for about 14 hours from the start of the outage.”

All of these factors prolonged the duration of the outage.

The review summary highlighted various measures Rogers took following the July 2022 outage to address deficiencies and improve network resiliency. The company added safeguards to routers in the core network to prevent another overload and also implemented a separate physical and logical management network. Rogers also deployed backup connectivity from third-party service providers to its network operation centre and other critical remote sites.

Other changes Rogers announced include separating its IP core network for wireless and wireline. However, this change remains a “work in progress.”

Moreover, Rogers switched to a new risk assessment algorithm, implemented organizational changes to improve collaboration between network operations and engineering teams, added an enhanced process for introducing new equipment and technology, and more.

Those interested can find the full review summary on the CRTC website. Source: CRTC Via: CBC News.