Back to all posts

Architecting for Scalability/Redundancy

Posted on Sep 11, 2020

Development

With the continued development of cloud-based resources, including seemingly easy ways to provide scale or geo-redundancy for applications, I often encounter development teams that don't fully understand the intricacies of architecting solutions that will work in these new environments. It is a common fallacy that you can "lift & shift" any application to the cloud and gain the value of scalability and geo-redundancy.

Defining Terms

It is important to set the stage for common understanding when discussing these topics. I feel it important to define a few key terms that come into play when discussing scalability/redundancy within applications.

Scale-Up (Vertical Scaling)

Scaling-up an environment is nothing more than a fancy way to say "throw hardware at it." When you scale-up an environment, you provide the environment more resources, such as CPU, Memory, or both. Almost all applications should support the concept of scaling-up without any real impacts.

Employing this concept, you can handle temporary spikes, poor coding practices, and many related common problems. However, you will not be able to provide robust solutions for failover options or redundancy for things such as software updates or system patching. Additionally, you will eventually encounter limitations in practical usage of resources, such as thread limits, maximum CPU or Memory capabilities, or even I/O limitations.

Scaling-Out (Horizontal Scaling)

Scaling-out an environment involves adding new resources into the environment, such as additional web servers or database servers to handle the additional load. Azure, AWS, and other cloud platforms support automatic scaling-out of an application using various metrics or triggers. One example would be a scaling rule that adds a web server each time the web-server has a CPU utilization of greater than 65% for 5 minutes.

This option can provide redundancy, support substantially higher traffic levels, and more. However, your application needs to be developed in a manner that can support the scaling scenario introduced by this concept. Depending on the particular cloud environment being used, you will need to consider items such as caching, file system storage, and internal task/job processes. Taking an existing application into an auto-scale environment can be a complicated process depending on the specific needs.

Geo-Redundant Deployments

The most complex environments to manage are geographically-redundant, meaning a deployment to more than one geographic region. You can achieve geographic-redundancy using various methods; however, it is best to break things into two categories. The first being an environment where traffic is actively served from multiple locations and the other where traffic is only served from an alternate region after a failure the primary. These environments might also be referred to as Hot-Hot and Hot-Warm, or similar terms.

Most applications, without proper architecture up-front, will need substantial work to support a geographically-redundant deployment. Caching, file storage, traffic routing, and more become much more complicated to manage in these environments. Initial testing can also give false-positive results of success when improper processes are utilized for testing/validation.

Code Architecture Considerations

Now that we have a common understanding of the types of environments utilized for better scale or redundancy, we can start to discuss architecture limitations and areas of focus. By reviewing how your application is built, you will determine what level of change is needed to support each environment with the techniques used. We will investigate Code Deployment, File Storage, Databases, Caching, Tasks, and Routing. Your application might not have all of these components or may have additional considerations.

For each of these sub-sections, we will review the possible challenges using two example applications; a traditional .NET Core application and a popular Web-Based Content Management System. I am simply utilizing the CMS system as an example in this case as it provides a nice sample that can be verified quickly for the sake of discussion.

Application Code Deployment

For many applications, this will be the easiest part, but certain frameworks can introduce additional challenges. You have to look at your application and determine a few things with regards to how your code is deployed.

Can my code be deployed repeatedly without issue?
Is there any configuration that needs to change based on deployment location? (Environment Variables, for example)
Does my application allow any possible changes, from the user, after a deployment that would impact my code.

The idea here is determining if additional efforts are needed to synchronize or deploy code. Let us consider two examples.

This Website (ASP.NET Core)

This website is built using ASP.NET Core. Any content changes are stored in the database, and uploaded images are loaded to Azure Blob Storage. This architecture allows us to drop the code and associated configuration to a new environment anytime without the worry of changes. Because of this, we have no concerns for Scale-Up, Scale-Out, or even Geo-Redundant deployments for this application relating to application code.

Content Management Systems (Such as DNN Platform)

Dynamic CMS, or other applications that allow user uploaded extensions are often difficult to redeploy to another location without copying or synchronizing them from the initial application source. This limitation would be the case with any application that supports dynamic content additions of executable code, such as DLLs or compiled views. Although limiting it doesn't necessarily limit all types of scale.

Scaling-up and scaling-out, at least with Azure technologies, is still totally possible with an architecture such as this. Azure uses a shared/mounted drive to synchronize the files in a scale-out situation where this limitation isn't a concern. This architecture model only suffers major limitations when looking to deploy to different geographic regions or situations where a shared file repository cannot be used for the application runtime.

Generated Content File Storage

Not all applications will start by separating generated file storage from application file storage; however, it is important to review how any generated/user content in the form of files is managed for scalability. Many different architecture solutions could be utilized for this type of content, including:

File upload within the application
File upload to SQL Server
Cloud file solutions, such as Azure Blob Storage or Amazon S3
Cloud file solutions with additional CDN for redundancy

It is important to review your architecture to determine IF you have this type of content, and secondly how you might mitigate the risks associated with this content.

This Website (ASP.NET Core)

This website supports dynamically uploaded content in the form of pictures within the actual blog articles. Once an article is posted, the article content is stored in the database. Links to any included images are built using an Azure CDN link to the upload within blob storage; this means that in our application, no changes to the local file system are completed during content publishing.

This architecture then ensures us an ability that we will be able to work in any scale environment, without change.

Content Management Systems (Such as DNN Platform)

In the DNN Platform case, numerous methods are used to upload files to the system. The default process stores all uploaded files in a folder under the /portals directory. This storage location is additionally used to house application-specific assets.

Like the limitations of the application code, this limits deploying in a Geo-Redundant manner. DNN Platform does offer some ability to manage this better using "Folder Providers" that allows you to store files in different locations; however, it isn't a true solution and has many limitations. For the DNN Platform to support geo-redundant storage, many additional hurdles must be overcome.

Databases

Cloud technology for scale with databases has lowered the cost thresholds substantially supporting scale-up and geo-redundant models with relative ease, including automatic failover and synchronization. Using Azure's Active Geo Replciation for example a connection string swap is all that is necessary to support this redundancy model.

Additional scale-up options exist and are a non-issue regardless of the type of infrastructure used.

This Website (ASP.NET Core)

Database geo-replication is typically not a huge burden on the application; however, it is possible with custom code to share the load in these situations. The above linked Microsoft documentation shows an example of configurations where the application can read from the secondary location, reducing the production database load. Some code changes would be necessary but easily possible.

Content Management Systems (Such as DNN Platform)

Thankfully, with failover being managed by the database connection itself, this sort of geo-replication is supported fully.

Caching

Application caching is a key performance functionality that is often overlooked in architecture discussions. We utilize cache within an application to prevent repeating expensive tasks repeatedly with every request, or even inside a single request. Issues with scalability and caching can be hardest to catch as they typically appear in certain situations and with certain changes depending on what is being cached and how it is used.

Scale-up situations will not impact the Caching model; however, as soon as you introduce additional servers, locally or geo-redundant, it becomes a primary concern. The ramifications of not solving for this could be as simple as a user not seeing new content or as complex as a deleted user still having access to a site for a while; all fully dependent upon what was cached and for what purpose.

This Website (ASP.NET Core)

Out of the box, .NET Core supports two types of caching IMemoryCache and IDistributedCache. Planning ahead, you can utilize the IDistributedCache method, set to "memory" to ensure that your application can transition over to a distributed cache if/when you need to scale.

Scaling-up this application is easy and will have no impact on caching. Scaling-out or Geo-Redundant will require a minor change to the application startup and the introduction of an external process for cache management. .NET Core supports SQL Server, NCache, and Redis out of the box. The configuration of these elements are pretty simple.

Looking at this from an architecture perspective, it would be prudent to start with IDistributedCache anytime that you believe future requirements will require supporting a scale pattern.

Content Management Systems (Such as DNN Platform)

Due to Content Management systems' dynamic nature, they will often leverage cache more heavily to reduce the performance impact of dynamic configuration elements loaded from the database. DNN Platform is no exception to this process.

Thankfully, the DNN Platform supports a provider-based model for caching implementation with out-of-the-box support for File, Memory, and Simple Web caching. These models work well single-server and traditional scale-out models; however, they do not work well for auto-scale dynamic nature. The greater DNN Platform ecosystem adds additional options with support for Redis and other cache providers that can fill those holes!

Tasks

The more complex your application is, the more likely you will have more components than a simple web application. It is common to need back-end tasks, or jobs, to execute tasks on specific intervals or respond to user requests. Many technologies can be employed for this; however, it is important to look at the models used and how they might be impacted in Scale-Up, Scale-Out, and Geo-Replication models.

This Website (ASP.NET Core)

For all .NET Core based projects that we utilize we include HangFire by default. HangFire provides a robust back-end job processing sub-system that fully supports .NET Core, durability, scale, and similar. It utilizes a SQL Server database to manage the queue of tasks and will ensure that regardless of the scale environment, your tasks execute when they need to.

Given that we have full control here, we need to ensure that any job we schedule is done so in a forward-thinking manner and ready for any scale setup. Hangfire takes most of that guesswork away from us.

Content Management Systems (Such as DNN Platform)

Every CMS system that I have worked with has an internal task system, the DNN Platform is no different. Each of these systems may have unique limitations or pitfalls. One key item to understand with these systems is what is being changed by the specific tasks. Sometimes a task might be used to clean up files on the file system or index content in the database. Depending on the target operation, you might see different impacts to scalability within your deployment.

For example, items that operate against the database only need to run once, even in geo-redundant and scale-out situations, does your platform support this? DNN Platform does so, but only for named servers. Because of this limitation, traditional scale-out or geo-replication will work; however, automatic scale with dynamic server names will be a real problem.

Workarounds for these can be designed; however, as with other elements of this discussion, you want to plan to avoid overworking your system or corrupting data.

Network Infrastructure Architecture

We have spent a good amount of time reviewing the code infrastructure needs for scalability; however, your network architecture can have a large impact on your options. These are temporary problems, but can result in delays should you need to promptly respond.

DNS & Traffic Flow

Scaling an environment up does not change any traffic flow for visitor traffic to your website; therefore, regardless of the environment, no changes will be needed. However, the desire to scale-out or geo-replication could require changes.

Scaling-Out

If you are using Azure AppService (PaaS), you can use the scale-out options without any DNS/Traffic flow changes. The system will manage this with your standard endpoint. If inside of a traditional hosting environment, it will be important to understand your provider's limitations concerning using a load balancer. Can they introduce it without a DNS change and the associated propagation delay?

Understanding limitations in a scale-out situation, especially if it is an as-needed plan, can be the dividing line between success and failure.

Geo-Replication

Implementation of geo-replication will always require planning as you must establish a process to route traffic to different data centers based on some criteria. Within the Azure world, this is where Azure Traffic Manager would come into play. Outside of Azure, you could utilize Cloudflare or other services; however, it is important to ensure that the process used is not tied to a single point of failure, such as a single physical data center.

This is an area of setup that can be confusing on the network architecture side. You might not be planning to introduce geo-replication for a while; however, setting up the tooling to enable it, even with a single location, will allow future deployment to be streamlined. For a little monthly expense, you can put items in place to avoid a new for breaking changes later.

Creating the Perfect Architecture

So, where do you go from here? How do you create the perfect architecture for your particular application? These answers will vary based on implementation; however, I hope the above discussion starts to open the thought process to architecting your solution. Future blogs will dive into various aspects of this.