Finally getting the opportunity to get back to SQL Server 2005 development, and coming to the conclusion that NOT all of my projects have to support SQL Server 2000 I started looking all of those "fun" new items that we all really wanted. This article is going to go through a scenario that demonstrates how to use Common Table Expressions, Rank() and Partition to get results for a very common data scenario. We will present this information with an introduction to the problem, scripts to set up the problem in a test environment, and lastly the implementation, with a review/summary at the end.
The Problem
Having an input source of data that looks similar to the below table it is easy to see how this could be real data. Take for example a data log file, where you have multiple error types that can occur with different messages.
RecordId | TypeId | ActivityDate | MessageText |
---|---|---|---|
1 | 1 | 10/1/2008 | Test Message (Max for Cat 1) |
2 | 1 | 9/28/2008 | Test Message (Other for Cat 1) |
3 | 2 | 10/15/2008 | Test Message (Max for Cat 2) |
The problem is that management wants to see a report that ONLY shows the "Most Recent" entry for each TypeId. Therefore in the above table, we need to omit record 2 as the 10/1/2008 entry is the most recent. Now I'm sure that you are thinking just like I was before going after this issue that it shouldn't be that hard, right? Well, it isn't, but SQL Server 2005's commands can make the code downright easy!
Setup Code
After providing a number of articles in the past I have decided for SQL Server tutorials like this to include SQL Scripts for creating the test data set, below are all scripts needed to create the test dataset that will be used in the next section. The bottom of this article will also contain a download link for the SQL Script.
The above script isn't anything fancy, just a create table and a few inserts, to show the differences. To see the larger impact of this process you may add more entries to SampleData.
Creating the Query
As I mentioned I am going to implement the solution using a Common Table Expression, which by nature means that we must have a processing step that occurs first, before we select the actual data, as that is the process flow for CTE's. In the case of this data set, we need to find a way of marking the MOST current entry, for each type with a value that we can later query out of the CTE. SQL Server provides a RANK() function, which simply ranks a result set, allowing ties to appear, therefore two entries with the EXACT same value will be given the same value. Now that is great for a whole dataset, but we need to rank items, inside each TypeId, not over the whole query, this is where PARTITION comes in. Partition is essential "Group By" for RANK. Therefore the first part of our CTE results in the following SQL.
If you EXECUTE the SELECT inside the CTE you will find that we have results like the following.
RecordId | TypeId | ActivityDate | MessageText | ActivityRank |
---|---|---|---|---|
1 | 1 | 10/1/2008 | Test Message (Max for Cat 1) | 1 |
2 | 1 | 9/28/2008 | Test Message (Other for Cat 1) | 2 |
3 | 2 | 10/15/2008 | Test Message (Max for Cat 2) | 1 |
The key piece of the puzzle here is the RANK() OVER line(s). In the OVER statement, we specify that we want to partition the rank based on the TypeId and that we want to order everything by the ActivityDate in descending order. This prepares a set of data for us, now, the query that uses the CTE is quite simple, and we get the desired results.
This simple select is combined with the previously discussed CTE and in my opinion, makes it really easy for people to read. Now, in this example, we used Rank which as I mentioned all values matching will be assigned the same rank. Take for example our current top value for TypeId = 1. If other records existed for TypeId =1 that shared the same ActivityDate, they would also be assigned the RANK of 1. Now depending on business needs, this MIGHT not get exactly what you are looking for.
To get an EXACT This is the FIRST top record, change RANK() to ROW_NUMBER(), as ROW_NUMBER() generates a unique id for EVERY row in the table, regardless of a tie situation.
Summary and Downloads
I hope that this article has provided some helpful information in regards to how to use CTE's, RANK(), ROW_NUMBER(), and PARTITION inside your SQL scripts. Below is a download link to a fully combined copy of the Script Examples provided here. Please share your comments below, and as always please feel free to submit any suggestions for future articles.