The MySQL Groupwise Maximum Problem
There is a surprisingly difficult task to solve with MySQL queries, which I’ve been spending some time trying to do – the Groupwise Maximum problem.
This is the name for the type of query that I was trying, although in fact I am trying to find a set of minimum (rather than maximum) values.
The question: What is the time each day that we a see a minimum of available bikes for? (a research question – as finding this answer will tell us something about the commuting habits of the city.)
The source data table:
My initial thoughts were:
select date(timestamp), time(timestamp), min(bikes) from bike_agg_london group by date(timestamp)
This produces the minimum bikes number for each day, which is great, but the timestamp included is just the first one of each day (in fact it could be a randomly chosen timestamp from within the day, but MySQL’s internal logic happens to pick the first one out). This is because the time(timestamp) is not part of the “group by” (aggregate) clause, and all fields in a query must be included in the group by unless they are part of the aggregate. I don’t want to aggregate the time(timestamp) though – I want the value associated with the minimum bikes, rather than the maximum, minimum or average (etc) value.
Here’s the SQL that worked for me, quite quickly (~18 seconds for around 166000 rows representing 600 days):
select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)
It’s the second solution from the above link. There is one problem, where if there are multiple rows in a day that share the same min(bikes) value, they each appear. Using distinct won’t get rid of these, because the time(timestamp) does vary. The fix is to use an additional wrapper (tables co3) to eliminate these duplicate rows:
select theday1, time(min(timestamp)), bikes from
(select date(b1.timestamp) theday1, b1.timestamp, b1.bikes from bike_agg_london b1 inner join (select date(timestamp) as theday2, min(bikes) as min_bikes from bike_agg_london group by date(timestamp)) b2 on (date(b1.timestamp) = b2.theday2 and b1.bikes = b2.min_bikes)) b3 group by theday1, bikes