Taking the top 5 of grouped ranked data in PostgreSQL

Working from a list of my ancestors’ occupations I wanted to list the top 5 occupations in each occupational category per decade, so that given

year	category    description   total
...
1851	Agriculture  Ag Lab	   4
1851	Agriculture  Farm Bailiff  1
1851	Agriculture  Farm Laborer  2
1851	Crafts       Blacksmith	   1
1851	Crafts       Carpenter	   3
1851	Crafts       Shoe Maker	   1
1852	Crafts       Shoe Maker	   1
1853	Crafts       Blacksmith	   1
1853	Crafts       Shoe Maker	   1
1854	Crafts       Blacksmith	   1
....

I would get

decade	category    description        total   rank
...
1850	Agriculture  Ag Lab		4	1
1850	Agriculture  Farm Laborer	2	2
1850	Agriculture  Farm Bailiff	1	3
1850	Crafts       Blacksmith		3	1
1850	Crafts       Carpenter		3	2
1850	Crafts       Shoe Maker		3	3
...

Using the windows functions introduced in version 8.4, this was the solution I ended up with:

  • The outermost loop selects the top 5
  • The middle loop ranks the data
  • The innermost loop sums the data by decade

SELECT * FROM (
  SELECT decade, 
              category, 
              description, 
              total, 
              rank() OVER (PARTITION BY decade, 
                                        category
                             ORDER BY total DESC, description) AS pos
     FROM (
          SELECT  
                 CASE
                      WHEN year BETWEEN 1840 AND 1849 THEN 1840
                      WHEN year BETWEEN 1850 AND 1859 THEN 1850
                      WHEN year BETWEEN 1860 AND 1869 THEN 1860
                      WHEN year BETWEEN 1870 AND 1879 THEN 1870 
                      WHEN year BETWEEN 1880 AND 1889 THEN 1880 
                      WHEN year BETWEEN 1890 AND 1899 THEN 1890 
                      ELSE 1900 
                 END as decade, 
		 category,  
		 description,  
		 SUM(total) as total
          FROM occupations
          GROUP BY 1,2,3) as orig
     ) AS ss
WHERE pos < 6;

The method for grouping by decade is suggested on grokbase.com.

The ranking then filtering is discussed on the PostgreSQL site:

SELECT depname, empno, salary, enroll_date
FROM
  (SELECT depname, empno, salary, enroll_date,
          rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos
     FROM empsalary
  ) AS ss
WHERE pos < 3;

Advertisements

About Jennifer Phillips Campbell

Software Developer and Medieval Historian
This entry was posted in PostgreSQL, SQL. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s