Since the code is still fresh in my mind, I thought I'd write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you'd have to read the hive code directly like I had to do.
I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data that comes from the ad servers on which we need to perform various kinds of analysis.
Writing a map-reduce job is not difficult in principle – it's just time consuming and requires the skills of a trained java engineer, which wouldn't be needed were we using SQL. That's where hive comes in: it allows us to query an hadoop data store using a flavor of SQL.
With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that's expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar to linux 10 years ago. One of this technologies is Hadoop, a distributed filesystem and data processing framework based on Google's Map/Reduce paper. Hadoop powers Yahoo! Search, Facebook and many other sites' data warehouses.
Today's rich IDEs make a lot of tasks easier...usually. With Java and its IDEs you often end up spending more time than you anticipated to just set up a project, especially when dealing with the complexities of J2EE: there are multiple versions of the specifications 1.3,1.4,5.0), each one with multiple implementations by different vendors plus extensions (richfaces, struts, seam, spring..). You have also to choose the container (tomcat, glassfish, jboss...). Last but not least, you can also pick different building tools (abt, maven...).
I was using wordpress, but I never found it neither too much user friendly, neither too powerful. After doing some research on web forums, I decided to commit to Drupal, which is regarded as having the best architecture.