Sunday, March 26, 2017

ASP.NET WebForm - Mac Safari problems after upgraded to Windows Server 2016

In the past few weeks my team had been struggling to resolve a very specific issue involve an old ASP.NET 3.5 Web Form Application and Safari after upgrading to Windows Server 2016.

Preface: Migration to Windows Server 2016

Earlier this year, one of our client got the budget approval to upgrade their web server from a legacy Windows 2003 Server to the latest Windows Server 2016. The migration ran relatively smooth, and the site was up and running.

The smooth migration went downhill from there. An hour later, our client called us. Apparently some users were having problem with the web site after the migration. Notably majority of those users were using a Mac computer and Safari browser. The symptom was that the user could login to the application, but then most of the navigation links did not work. The browser kept loading and loading and nothing was happening.

We quickly turned to our QA Mac machine and fired up Safari. And no problem. The site loaded up quickly, and everything worked fine.

Let the Investigation Begin

The next day, our client reported that they have more complain from users who had similar problems. All of them using Mac Safari. A good news was that our client managed to replicate the issue on one of their Mac machine. Hence, at least we had something we could reproduce the problem. A start.

Tracking our ASP.NET Web Application Exception Log, there were nothing other than some login failure attempts. No fatal crash on the web app. At least nothing within the ASP.NET application. Good, no blame for our developers, but the problem remained. Furthermore, we had not made any code change before or after the migration, so if our code worked before they should work after.

A web session with our client and screen cast suggested that their Mac was running the latest Mac OS Sierra (10.12), whereas our test Mac machine running earlier version (Yosemite). This kind of explained why our mac machine work while others did not. So is it something wrong with the latest Mac/Safari? More investigation needed.

We changed our User-Agent String on our Mac to match that of the troubled Mac. Everything still work fine. Indeed, we went through every setting to line up our Mac with the client's troubled Mac, and still nothing.

A few Google search later, it seemed like there were quite a few issues across many web sites with the latest Mac Safari with Sierra. Many of them seemed to be involving SSL certificate not being trusted. We quickly used some of the online SSL certificate checker tools to verify the SSL certificate on the server. Everything seemed to be fine. Nothing particularly wrong with it. Furthermore, the troubled SSL trusted chain being reported was associated with a different CA vendor than ours. So it seemed to be unrelated to our problem.

Also we had tested with other browsers on the troubled Mac. It seemed like other browsers work fine. All navigation links worked.

Some old forum discussion suggested that older Safari had some problems with long viewstate on postback. However, according to those discussions most of those issues were resolved by Apple or Microsoft. Furthermore, if it had anything to do with the ViewState, then wouldn't the old server had the same behavior? Yet the problem only appeared on the new 2016 Server with IIS 10 instead of the old 2003 Server with IIS 6.5.

Another web session with our client, and it seemed like the troubled Mac Safari could access the site on our QA server (also a Windows Server 2016) without any problem. It was the new production server that caused problem. Furthermore, it seemed like none of those users had experienced any problems before the migration. Our client, who still had access to the old server, was able to connect to the old server with the Mac and everything seemed to work fine there.

That's Great. Did it mean the problem was between new Safari and Windows Server 2016? Or worst, was it only between new Safari and this particular server?

Another web session with our client, this time we tried to enable SSL on our QA server (which is the only things that different than the production server).

And Bingo, the troubled Safari experienced the same problem as before on our QA server.

Unfortunately, our client needed to have SSL enabled for the whole application. We cannot disable SSL.

Well, at least we could confirm that there was something between the new Safari and Windows Server 2016 w/ SSL, instead of on a specific web server. Moreover, we could test different configuration changes on the QA server without affecting end users.

Investigation Continued, but the cloud getting darker

We enabled IIS Log on both the production server and our QA server in an attempt to collect more information regarding the problem. After looking through the IIS logs for hours, and almost burned away our retina, we managed to find some more information, but not particularly useful nor pin-pointing the problem.

Based on the IIS Logs, it seemed like most of those problems occurred on HTTP GET action, and from Safari browser only. They all got 500 Internal Server Error, which suggested that something was wrong on the server. Furthermore, the logs suggested that the some of the errors had a User-Agent from Safari on iPad.

One interesting thing was that the problem seemed to be occurred more on HTTP GET request than POST request. In ASP.NET WebForm application, basically only the first call to an aspx page would be a GET request. Afterward, all back-and-forth request for the page usually using APS.NET postback, which exclusively used HTTP POST method.

This also meant that this problem was unrelated to those old forum discussion on Safari problem with long viewstate.

At this point, let's gather all the information we know:

  • Problem with latest Safari running on Mac Sierra or iPad.
  • Problem with HTTP GET request over SSL/secure-HTTP connection only.
  • No exception caught in ASP.NET Exception handling events.
  • IIS Log returned 500 Internal Server Error.

Also, both our client and the users were getting impatient with this problem. All we could say was that to use a different browser until we could identify and fix the problem. And I could tell you that they were not liking that answer!

More Google Search, but nothing in particular related to our problem-on-hand. It seemed like unless we could uncover the cause, we were either switching back the site to the old server or to rewrite the whole web app. (well, it was an old application that migrated from .Net 1.1 to 2.0 to 3.5, with a lot of carry-over legacy code. It would be very time consuming if we were to completely rewrite it in .net 4.5 or .net core 1.0)

Light at the end of the tunnel via Failed Request Tracing

One of the Google Search result suggested to install the Failed Request Tracing Module for IIS, which was an IIS feature used to trace errors that might fall outside of the web application domain. We installed the module on our QA Server. We tested it with Firefox and Internet Explorer, and no failed request log was generated. Then we got the troubled safari to make a few request on our QA server. A few clicks on the troubled Safari browser generated over 100 failed error log files on the Failed Request Logs!

The failed request log file was an XML file, and with the provided XSLT we wrote a quick app to convert the XML to a more readable html report. After reviewed a few of those reports, they all pointed to the exact same error:

ModuleNameDynamicCompressionModule
NotificationSEND_RESPONSE
HttpStatus500
HttpReasonInternal Server Error
HttpSubStatus19
ErrorCodeAn operation was attempted on a nonexistent network connection. (0x800704cd)

Good News. We were able to log the failed requests that our users were experiencing.

Overall, the error suggested that the connection between client and server had been dropped before the server sent the response back to the client.

A few Google Search later, it suggested that this specific error had something to do with that ASP.NET took more time than the execution timeout specified in IIS, so IIS had already returned a Server Busy response before ASP.NET returned and send to the next IIS pipeline module.

We checked the httpRuntime executionTimeout, which is set to 300 seconds (5 minutes). This was not exactly a short time for web server timeout setting, which was default to 110 sec (1 minute 50 seconds). The failed log also had a execution time of approximately 1500 milliseconds (1.5 seconds). The 300 seconds should be sufficient. Nevertheless, we increased the timeout to 3000 seconds (50 minutes), and cleared the logs folder to restart the test.

5 minutes later, another 100+ of failed logs. It's not the executionTimeout.

We tried a few of tweaking suggested from Google Search Result. After all there were like 75 results in total for this specific error. Most of those results were directly related to other applications/platforms/systems. None of those tweak made any difference.

We were ready to admit defeat and would like to suggested to switch back to the old 2003 server. We scheduled a call with our client on the next morning. Hence we had about a few hours for our final attempt to solve the problem.

We looked at the server, and decided to disable the dynamic content compression option on the IIS. This is basically the module that reported the error in all those failed request log. We did not think that disabling this module would make any difference. It just that by disabling any non-essential modules might lead us to the real culprit that caused the problem.

Disable the dynamic content compression module, clear the failed logs folder. Restart the test.

2 minutes later, you could hear our team roaring out loud. YESSS!!!

The troubled Safari worked normally on our QA server, after we disable the dynamic content compression in IIS. The failed request logs folder remained empty. No errors.

We did the same thing on the production server to disable the dynamic content compression. The client did a quick test again, and finally the problem was gone.

Conclusion

Based on what we have tested, we have this hypothesis for the problem: The combination of ASP.NET 3.5 ViewState + Dynamic Content Compression + SSL causes the latest Safari to drop the connection prematurely, and has the browser stuck in the loading screen. It may be that the combination created some specific character sequence that cause unexpected behavior on Safari (and only affect the latest Safari).

Well, that may be another investigation for another day. Now everyone in our team can sleep well at night once again!

Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth. -- Arthur Conan Doyle