SQL Escapades: OVER and PARTITION BY

On a handful of occasions I've needed to write a query that returns only one record per group of records. Like, needing to retrieve one customer record per address so we can send some mail.

Typically, you can accomplish these types of queries using GROUP BY statements. Maybe something like this:

SELECT   street_address, city, state  
FROM     customers  
GROUP BY street_address, city, state  

But what if we need to include the customers' first and last names in the results? We can't add the name field to our SELECT because it's not in our GROUP BY clause. And we don't want to run an aggregate function on the names because we can't guarantee the functions will return a first and last name that match. This problem is that a grouping query isn't want we want to do. We don't want to get back aggregate data. We want to get back data specific to one customer -- we just want one per address.

So what do we do?

OVER

The OVER clause in SQL is a feature that allows you to retrieve aggregate data together with data from individual rows. For example, if you want to return all sales records and (for some reason) the total revenue of all sales with each record, you could do this:

SELECT *, sum(price) OVER() AS total_revenue  
FROM   sales  

While this might be useful occasionally, it's not helpful to our goal of retrieving one customer per address. We need...

PARTITION BY

The OVER function doesn't have to be empty, as above. It can take arguments! PARTITION BY is one such argument that will run the aggregate function against the table grouped by the the given partitioning column.

So say we want to get revenue subtotals by year instead of a grand total for our sales records. In Postgres we could to this:

SELECT *, sum(price) OVER(PARTITION BY date_part('year', created_at)) AS monthly_subtotal  
FROM   sales  

Putting It Together

Let's get back to our original question of retrieving one customer per address. Have you figured it out yet? We can partition by the customer's address! Then it's a simple matter of applying the row_number() aggregate function to the partitioned results to select only one customer per address.

SELECT results.*  
FROM (  
    SELECT c.*, row_number() OVER(PARTITION BY street_address, city, state) AS row_num
    FROM   customers c
  ) results
WHERE results.row_number = 1  

Voila! We've got one full customer record per address!

Let's do another example to drive it home. Say we're running a news database with "stories" that each belong to a single "category". Let's write a query to retrieve the five most recent story titles from each category.

SELECT title, row_number() OVER(PARTITION BY category_id ORDER BY created_at DESC) AS row_num  
FROM   stories  
WHERE  row_num <= 5  

I'd love to see other takes on how to do these queries, so please leave comments below if you've got other ideas!