NTA - C30 Rebooting and Cloud Phonebook – Incident details

No Service Impact Expected experiencing major outage

C30 Rebooting and Cloud Phonebook

Resolved
Major outage
Started 4 days agoLasted 2 days

Affected

Secondary Services

Operational from 10:00 AM to 11:16 AM, Major outage from 11:16 AM to 3:14 PM, Partial outage from 3:14 PM to 10:28 AM

Handset Provisioning

Operational from 10:00 AM to 11:16 AM, Major outage from 11:16 AM to 3:14 PM, Partial outage from 3:14 PM to 10:28 AM

Updates
  • Resolved
    Resolved

    Whilst we are aware there are a few extensions with problems still, we will resolve those shortly. Please see below the RFO

    In the new implementation of our Cloud Phonebook we decided to use some proxies to improve its availability and also add a few new features. We opted for using HAProxy, a well-known open-source project, that has been in use for many years (it is responsible for handling provisioning, wallboards, MobeX plus many internal projects). We also developed an LDAP Proxy to allow us to add a cache to improve the responses and sorting (something that models like C30 and below cannot do).

    The solution was tested internally and worked fine. Unfortunately, while moving it to production, we started to receive reports that the Cloud Phonebook was failing and some devices (A10, B20 plus C30) were rebooting.

    After the initial investigation, we found that some of the behaviours were causing the connections to be denied, so we improved it to handle an increase in traffic that we couldn't reproduce during the internal tests. However, the rebooting of the devices was unexpected and the only real hypothesis was that something in the middle between the LDAP servers and the devices was creating packets that the devices couldn't handle.

    Initially, we decided to stop using the LDAP Proxy as it could be sending the wrong packets which started to reprovision the devices. This is a process that could take several hours because it would not be good to have all the devices coming to the system at the same time. 

    After the C30 devices were reprovisioned, we kept monitoring the systems and took notes of reports. There was also a possibility that the devices that were still rebooting were the ones we could not reach for reprovisioning because they were in use at the time that we sent the requests.

    After more internal checks we noticed that this was not the case, so the only other piece of software that could be causing the reboots was the HAProxy itself. The fact that it was affecting only the old Hosted models probably meant that they couldn't handle some packets that were being sent to them.

    We started to reprovision all devices to connect directly to the LDAP servers, avoiding any proxies in the middle, after this was done no more reboots were detected. This means that we can't use the HAProxy with the LDAP servers.

    However, it doesn't mean that we will not be implementing our LDAP Proxy again in the near future, since the LDAP Proxy was behind the HAProxy, it is possible that the LDAP Proxy was not failing, but just facing the same issues with the HAProxy too.

    This implementation will be done in a slow-paced rollout to avoid causing more issues with the cloud phonebook access in the future. The new LDAP Proxy is a really important part of our new cloud phonebook strategy because it will make the Dir requests much quicker and it'll add sorting support for models A, B plus C30.

    We really do apologize for this unexpected issue and will keep you updated on our plans in the future

  • Update
    Update

    Yealink handsets will be reprovisioned after 18:00 tonight which should resolve the ringtone issues

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. We are sending the connections directly to the LDAP server therefore bypassing the haproxy now and removing anything obstruction in the chain

  • Update
    Update

    We will be reprovisioning the C30 devices, after that it will be A, B, E, F, and then C33 and Ds which will take one hour per group

    After this,the number of entries in the Cloud Phonebook will be worked on followed by the ringtones

  • Identified
    Identified
    We are aware of the C30 handsets rebooting/cutting off. This has been identified as an issue with LDAP. Please reset thee handset by pressing the OK button for 10 seconds to clear the configuration out of the handset to clear this. This is also causing Cloud Phonebook issues on other handsets, we are continuing to work on a fix for this incident.