Big data, big ball-ache

Big data, yeah? It’s great isn’t it? Doesn’t everyone just love to have loads of big data all over the place?

Got 30 million customers in the UK, have you? Each of those customers purchasing thousands of products a year, yeah? Screw it, lets just store ALL that information in a massive database. It’s big data innit? It’s what people do now.

Well I’m sick of it. Regular readers will know that I’m currently in the process of trying to gather trade data from the UN. It’s of the format “we sold this much soap to this country in this year”. Sounds simple, right?

Well it is. But it’s also big. There are around 200 reporting countries, reporting trade with one another, in over 3,000 product categories across fifteen years. This makes the final database somewhere in the region of 150 million rows long. It’s big, and it’s slow, and it’s incredibly painful to deal with.

By way of an example, let me introduce you to a painful problem which has bedevilled me these past few days: due to some kind of wierdness with the import process, some countries’ data ended up with an equals sign at the start of their product codes. So instead of product code “101305″ they had “=101305″. I can’t even remember now how this happened, but it’s to do with the fact that the data sets are so large, that they could only be opened in certain pieces of software, one of which has obviously had this wierd side-effect. The affected countries are Japan, Brazil, China, India, Russia and Mexico. So, some nice small countries then. This means that 20 millions pieces of data need an equals sign removing. Sounds easy right?

The process to get rid of those equals signs started yesterday evening, and was still running this lunchtime, a full eighteen hours after it started.

This is not tenable. This is not big data. This is just a big ball-ache.