Thursday, August 27, 2020

The x509 certificate errors mystery finally resolved

This is probably one of the weird bugs that last for so long that my team has no idea why this is happening...

Background

We have migrated an UI Portal to Microsoft Azure as an (Web) App Service. The new site is using Azure Active Directory B2C as the front-end authentication mechanism.

Shortly after it goes live, we have received various support calls indicating the authentication did not work.

After restarting the site, the system is back to normal, working as expected.

Further Investigation in the error log suggested that there's some x509 certificate trust problem happening. What's interesting is that the trust is broken on the AAD B2C authentication front.

A few weeks later, same problem happen again, and it all resolved after restart. The same issues happened every 3-4 weeks, and all resolved after a restart of the App Service.

We've been in contact with Microsoft Azure support, and after many rounds of meetings, and it seems like no one has any clues why that's happening. It's like the Azure App Service doesn't trust the SSL/TLS certificate from Azure AD B2C.

We've tried various things, including changing and enforcing the SSL/TLS version to SecurityProtocolType.Tls12, adding custom System.Net.ServerPointManager.ServerCertificateValidationCallback to log additional info, and trust the cert blindly (Note: Do Not Try this on your PROD environment...), etc. All these changes do not resolve the issue. Without failing, every 3-4 weeks the same issue happened again, and usually resolve itself after restarting.

Azure Support team suggested that we add a alert monitor so if the same issues happened again, it will handle the restart. This we did, and boy, it works fine... Yet, the mystery remains...

That was 3 years ago...

Finally, the mystery is solved

Out of the blue, while investigating into a bug/issue in one of the reporting modules, something caught our eyes: In that reporting module, it also has a System.Net.ServerPointManager.ServerCertificateCallback function. A closer look into the function reveals that it's checking for a specific certificate thumbprint in the x509 certificate chains. If the thumbprint not found, it will return certificate error.

After a few calls and discussion with the other dev team, it has revealed that the dev/QA environment has a self-signed certificate that this module uses, and that the dev who write this function copied it from another project, which in that specific case the source is explicitly set to only trust a root CA certificate thumbprints (for security/privacy checking, to ensure the service host is legitimate, etc.)

Once this module is loaded into memory, the ServerCertificateValiationCallback event "override" the default function, and basically render all other certificates to be untrusted, including those from Azure AD B2C. This specific reporting module is mostly being used for monthly report generation, so it only happen in every 3-4 weeks time, when the user wants to generate the monthly reports.

Once we have identify the root cause, we have QA team to repeat the same behavior on QA environment. Bingo, we can reproduce the same problem... this is indeed the root-cause of the x509 cert error!

We did a hotfix and apply to the module. Since then this work relatively stable.

Afterthoughts

It seems to me that the System.Net.ServicePointManager.ServerCertificateValidationCallback can be very dangerous. It does provide some "work-around" to overcome certificate trusting issues, yet on the other hand you may have no foreknowledge that other modules or components may have changed the behavior of the callback function.