Today a colleague approached with an interesting problem. The SharePoint farm consist of two server, which both serve as a web frontend server. One of the two acts as the application server. When he opens a specific site collection, the first SharePoint node returns the page as expected, the other node however just returns a 404.
First looks into the problem
As usual I fire up my ULS log viewer and start collecting information from the farm. Because I know how to reproduce the error I simply need to filter the message column for the site collection Uri that returns the 404 error. Knowing the timestamp when the 404 page was requested and filtering the ULS log, finding the correlation id that recorded the event is a piece of cake.
The findings in the ULS log didn’t give away a lot of information:
Could not create the site from the context An unexpected error occurred while toggling web parts for https:///sites/5 site. Error: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.SharePoint.SPSite.get_SqlSession() at Microsoft.SharePoint.SPFeatureCollection.EnsureFeaturesData() at Microsoft.SharePoint.SPFeatureCollection.get_Item(Guid featureId) at custom.Services.customSettingsService.EnsureBackupListFeature(SPSite site) at custom.Services.customSettingsService.ToggleWebParts(SPSite site, List`1 selectedWebParts, Boolean featureReceiverCall) --> custom.Commands.Exceptions.IPIException: An unexpected error occurred while toggling web parts for https:// /sites/5 site. Error: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.SharePoint.SPSite.get_SqlSession() at Microsoft.SharePoint.SPFeatureCollection.EnsureFeaturesData() at Microsoft.SharePoint.SPFeatureCollection.get_Item(Guid featureId) at custom.Services.customSettingsService.EnsureBackupListFeature(SPSite site) at custom.Services.customSettingsService.ToggleWebParts(SPSite site, List`1 selectedWebParts, Boolean featureReceiverCall)
Well, luckily Windows also has the Event log that might give some interesting information. It’s always a good idea to check all event logs from all servers in the SharePoint Farm. And there were 3 entries that had the same timestamp as my ULS event.
An exception occurred when trying to issue security token: The server was unable to process the request due to an internal error. Event Id: 8306
Now that’s something to work with. As it seems, the web frontend server cannot render the aspx page because the application server had trouble to issue the security token. So I open the SecurityTokenServiceApplication web services in a browser with TLS via https:// on both server.
One of requests has a valid certificate, the other server does not. This isn’t right because the service should be signed by the SharePoint Farm certificate.
Squishing the pesky little bug
Opening a MMC console and connecting to the machine certificate store, I can’t find the SharePoint Root Authority certificate in the Trusted Root Certification Authorities store. To fix the issue I simply export the certificate from the other Farm member and import it to the server with the missing certificate.
The SharePoint services didn’t pick up the changes right away so I restarted the Farm members with and iisreset.exe. After the restart the service was authenticated and the other servers could use the service again.