As web application become more complex, or as individuals demand better performance or reliability, the addition of more web nodes is very common. For those with a technical background, we can easily see the complexity that is introduced as we load balance our applications. Simple tasks such as writing a file to the local file system have now become far more complicated tasks. How are the servers getting their content? Shared file system? File Replication? (DFS, RoboCopy, etc.) When things work beautifully, it is great, but when they don't it can be a nightmare. In this post, I want to share a few "lessons learned" as it relates to load balancing diagnostics. The information contained in this post is specifically related to working within a DNN Installation, however, the same principals will apply regardless of the platform.
Setting the Scenario
It is very easy to talk in general terms when outlining the importance of a particular configuration item. However, I've found that it is often easier to have an illustration as to WHY a particular configuration is a good idea, rather than simply "You should do this!." In the situation that I will outline we started with an Evoq based solution, using somewhere between 40-50 parent portals, load balanced across 4 web nodes within the Amazon AWS Environment. Within this environment DFS replication was used to ensure that the file systems were kept in-sync and that each node could stand on its own, preventing the need for a beefy file server. A separate cluster configuration was used for the back-end database, however, it isn't relevant for this setup.
From a DNN Configuration perspective, all best practices were followed with regards to the configuration of Scheduler Jobs, caching and the avoidance of session.All web traffic to each of the 40+ portals is routed through the load balancer at all times. The initial configuration additionally exposed internal network communication to Portal 0 using the internal AWS IP addresses. This configuration is needed in the Evoq solution to ensure that each portal can notify the other portals of any expiration of items in the cache.
The above-listed environment has been working well for the better part of 2 years, handing periods of extreme traffic without incident. A module update was installed from a trusted vendor to help improve the end-user functionality. The installed module, which I will not directly identify, is used on all portals of the site and provides critical functionality to the site. After installation, the module stopped working on most portals due to data corruption issues. The module itself also was acting very funny, showing inconsistent data every 2-3 seconds as it provided system status information.
The module had internal methods designed to resolve the data corruption, however, none of these actions were helping and overall it was impacting the sites in a negative manner. As we started to dig into the problems we noticed that the module, although configured to run its "scheduled" items on a specific server, when using the internal processes to resolve corruption it executed these processes in a different manner, manually triggering a background thread. What resulted was a situation whereas it polled for information every few requests it would see proper status information and the rest it wouldn't know about the pending job and would try to trigger it to start again. This process then resulted in multiple nodes trying to write content to the same folder... a truly non-fixable solution.
How we Fixed it?
The above problem was easy to identify, but it was far from easy to resolve. We had a module that was not truly supporting of load-balanced environments with corrupt data, and a site suffering performance issues from it. This issue wouldn't have existed on a single server environment as it was related to synchronization and statuses. What we needed in our case was a way to ensure that we were running on a single server to fix the corruption. With 50 portals, we didn't have an easy way to do this, as we needed to trigger a manual fix on EACH portal to properly resolve the issues.
For the pending "emergency" we simply removed 3 of the 4 nodes from the load balancer and updated all of the sites. In our case, we are fortunate enough that the day this happened was a VERY slow day and we could run on a single server. But this isn't a good thing.
What SHOULD your Environment Support?
Looking back at the issue, it would be really simple to have resolved this issue if we had something as simple as "admin.mysite.com" that would direct us to a single node. We could use that to bypass the LoadBalancer and ensure that we worked only on a single server. As I look at the environment configuration of all of my load balanced customers only around 25% of them had an already established manner to drive to a specific server. As I look at the things that CAN go wrong with a load-balanced environment it is important to be able to get to a single node, without having to impact the overall configuration. We don't need to touch every node in the farm, but the ability to bypass the load balancer if we encounter a component, or write a component, that doesn't play well in a load-balanced environment.
I hope that this was helpful. If you are working in a load-balanced environment I strongly encourage you to review your configuration to see if you have the ability to target a specific node. If you don't you might try to add one. I should note that this might not be as big of a deal for those running "Sticky Sessions" in their environments. However, I recommend against the use of Sticky Sessions whenever possible as the quality of load distribution is exponentially better if Sticky Sessions are not used.