November 01, 2008

SQL Server RANK, ROW_NUMBER, and Partition

Finally getting the opportunity to get back to SQL Server 2005 development, and coming to the conclusion that NOT all of my projects have to support SQL Server 2000 I started looking all of those "fun" new items that we all really wanted. This article is going to go through a scenario that demonstrates how to use Common Table Expressions, Rank() and Partition to get results for a very common data scenario. We will present this information with an introduction to the problem, scripts to setup the problem in a test environment, and lastly the implementation, with a review/summary at the end.

The Problem

Having an input source of data that looks similar to the below table it is easy to see how this could be real data. Take for example a data log file, where you have multiple error types that can occur with different messages.

RecordId TypeId ActivityDate MessageText
1 1 10/1/2008 Test Message (Max for Cat 1)
2 1 9/28/2008 Test Message (Other for Cat 1)
3 2 10/15/2008 Test Message (Max for Cat 2)

The problem is that management wants to see a report that ONLY shows the "Most Recent" entry for each TypeId. Thefore in the above table we need to omit record 2 as the 10/1/2008 entry is the most recent. Now I'm sure that you are thinking just like I was before going after this issue that it shouldn't be that hard, right? Well it isn't, but SQL Server 2005's commands can make the code down right easy!

Setup Code

After providing a number of articles in the past I have decided for SQL Server tutorials like this to include SQL Scripts for creating the test data set, below are all scripts needed to create the test dataset that will be used in the next section. The bottom of this article will also contain a download link for the SQL Script.

CREATE TABLE SampleData
(
	RecordId INT IDENTITY(1,1),
	TypeId INT,
	ActivityDate DATETIME,
	MessageText TEXT
)
GO

INSERT INTO SampleData
	(TypeId, ActivityDate, MessageText)
VALUES
	(1, '10/1/2008', 'Test Message (Max for Cat 1)')

INSERT INTO SampleData
	(TypeId, ActivityDate, MessageText)
VALUES
	(1, '9/28/2008', 'Test Message (Other for Cat 1)')

INSERT INTO SampleData
	(TypeId, ActivityDate, MessageText)
VALUES
	(2, '10/15/2008', 'Test Message (Max for Cat 2)')

The above script isn't anything fancy, just a create table and a few inserts, to show the differences. To see the larger impact of this process you may add more entries to SampleData.

Creating the Query

As I mentioned I am going to implement the solution using a Common Table Expression, which by nature means that we must have a processing step that occurs first, before we select the actual data, as that is the process flow for CTE's. In the case of this data set, we need to find a way of marking the MOST current entry, for each type with a value that we can later query out of the CTE. SQL Server provides a RANK() function, which simply ranks a result set, allowing ties to appear, therefore two entries with the EXACT same value will be given the same value. Now that is great for a whole dataset, but we need to rank items, inside each TypeId, not over the whole query, this is where PARTITION comes in. Partition is essentially "Group By" for RANK. Thefore the first part of our CTE results in the following SQL.

WITH RankedData AS
(
	SELECT
		SD.RecordId,
		SD.TypeId,
		SD.ActivityDate,
		SD.MessageText,
    	Rank() OVER 
			(PARTITION BY SD.TypeId 
				ORDER BY SD.ActivityDate DESC) AS ActivityRank
    FROM SampleData SD
)

If you EXECUTE the SELECT inside the CTE you will find that we have results like the following.

RecordId TypeId ActivityDate MessageText ActivityRank
1 1 10/1/2008 Test Message (Max for Cat 1) 1
2 1 9/28/2008 Test Message (Other for Cat 1) 2
3 2 10/15/2008 Test Message (Max for Cat 2) 1

The key piece of the puzzle here is the RANK() OVER line(s). In the OVER statement we specify that we want to partition the rank based on the TypeId and that we want to order everything by the ActivityDate in descending order. This prepares a set of data for us, now, the query that uses the CTE is quite simple, and we get the desired results.

SELECT
    RankedData.RecordId,
	RankedData.TypeID,
	RankedData.ActivityDate,
	RankedData.MessageText
FROM RankedData
WHERE RankedData.ActivityRank = 1
ORDER BY TypeId

This simple select is combined with the previously discussed CTE and in my opinion makes it really easy for people to read. Now, in this example we used Rank which as I mentioned all values matching will be assigned the same rank. Take for example our current top value for TypeId = 1. If other records existed for TypeId =1 that shared the same ActivityDate, they would also be assigned the RANK of 1. Now depending on business needs, this MIGHT not get exactly what you are looking for.

To get an EXACT This is the FIRST top record, change RANK() to ROW_NUMBER(), as ROW_NUMBER() generates a unique id for EVERY row in the table, regardless of a tie situation.

Summary and Downloads

I hope that this article has provided some helpful information in regards to how to use CTE's, RANK(), ROW_NUMBER(), and PARTITION inside your SQL scripts. Below is a download link to a fully combined copy of the Script Examples provided here. Please share your comments below, and as always please feel free to submit any suggestions for future articles.

Click here to download the RANK demonstration file

tags: Tutorials, SQL
comments powered by Disqus

Content provided in this blog is provided "AS-IS" and the information should be used at your own discretion.  The thoughts and opinions expressed are the personal thoughts of Mitchel Sellers and do not reflect the opinions of his employer.

Content Copyright

Content in this blog is copyright protected.  Re-publishing on other websites is allowed as long as proper credit and backlink to the article is provided.  Any other re-publishing or distribution of this content is prohibited without written permission from Mitchel Sellers.