Thursday, 7 November 2013

Removing Duplicates


A confounding yet common problem with data is duplicated information. It could be in a poorly designed table, in imported data or other situations, but the problem remains the same, how do you tell your system to delete all but one copy of the row? Fortunately, SQL Server has a tool that makes it easy: ROW_NUMBER().
In duplicate data, each row is the same. While you can look at it and decide which row you’d want to delete, the problem is how you can tell SQL Server which row that is. ROW_NUMBER() lets you add a column that differentiates between the rows and you can delete all but the desired ones.
Lets say you have a table
CREATE TABLE DuplicateRow(
FName varchar(30),
LName varchar(30),
JobTitle varchar(30),
Age tinyint
)
We’ll keep it simple for this example. Now let’s say you query the table and find that when you were doing the inserts you were a little too enthusiastic in hitting “execute”
FNameLNameJobTitleAge
MikeKlarmManager37
MikeKlarmManager37
MikeKlarmManager37
You don’t want him in there three times, but if you tell SQL Server to DELETE based on any of the data you have available, it’ll delete all three records and you’ll have to insert again. To keep one row and delete the others, you’d use ROW_NUMBER(). We’ll start with a SELECT so you can see what’s happening.
SELECT
  ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow
This gives you
RFNameLNameJobTitleAge
1MikeKlarmManager37
2MikeKlarmManager37
3MikeKlarmManager37
And now you can just delete any rows where R > 1
DELETE q
    FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, 
LName, JobTitle, Age ORDER BY LName ) R
              , FName
              , LName
              , JobTitle
              , Age
            FROM DuplicateRow
         ) q
    WHERE R > 1
you don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.
If there are other rows in the database, the first instance of every set of data will be numbered 1 and will not be deleted, but every record that is a duplicate will be deleted.
Let’s look at ROW_NUMBER() just a little more closely. All of the magic happens in the OVER() clause that follows ROW_NUMBER(), specifically the “partition by” section. This is where you list the columns that you expect to be duplicated. Since we want to delete any rows where every single column is duplicated, we list all of them here. “Order by” doesn’t make much difference here since all the rows are the same, but you do need a value here for it to work.
If your data is a little different, you can use ROW_NUMBER to get rid of different combinations of data. Let’s say Mike got a promotion last year and someone added a row instead of updating it.
FNameLNameJobTitleAge
MikeKlarmManager37
MikeKlarmManager37
MikeKlarmManager37
MikeKlarmOperator32
MikeKlarmOperator32
MikeKlarmIntern22
Depending on what you put in your “partition by” statement, you can decide what to delete. If you want to keep one of each set, use the same query as above and you’ll get
RFNameLNameJobTitleAge
1MikeKlarmManager37
2MikeKlarmManager37
3MikeKlarmManager37
1MikeKlarmOperator32
2MikeKlarmOperator32
1MikeKlarmIntern22
It will number each row, starting at one for each unique combination of data listed in the “partition by” statement. If you then run the above DELETE statement, you’ll be left with
FNameLNameJobTitleAge
MikeKlarmManager37
MikeKlarmOperator32
MikeKlarmIntern22
But let’s say you only want to keep the most recent record. You could run the above query and also delete any record where the Age <> 37, but that doesn’t work as well if you have a lot of data in the table, you’d have to specify just Mike Klarm to keep from deleting anyone who isn’t 37.
An easier way is to modify your “partition by” clause.
SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName ORDER BY Age DESC ) R
      , FName
      , LName
      , JobTitle
      , Age
    FROM DuplicateRow
You’ll end up with
RFNameLNameJobTitleAge
1MikeKlarmManager37
2MikeKlarmManager37
3MikeKlarmManager37
4MikeKlarmOperator32
5MikeKlarmOperator32
6MikeKlarmIntern22
And when you delete where R > 1 you’ll be left with the most recent record. You can use the “partition by” clause to get many different sortings on data that’s only partially duplicated.

No comments:

Post a Comment