Browsing All Posts filed under »programming«

Creating Nested data (Parquet) in Spark SQL/Hive from non-nested data

April 4, 2015


Sometimes you need to create denormalized data from normalized data, for instance if you have data that looks like CREATE TABLE flat ( propertyId string, propertyName String, roomname1 string, roomsize1 string, roomname2 string, roomsize2 int, .. ) but we want something like   CREATE TABLE nested ( propertyId string, propertyName string, rooms <array<struct<roomname:string,roomsize:int>> )   […]

Structured data in Hive: a generic UDF to sort arrays of structs

September 17, 2013


Introduction Hive has a rich and complex data model that supports maps, arrays and structs, that could be mixed and matched, leading to arbitrarily nested structures, like in JSON. I wrote about a JSON SerDe in another post and if you use it, you know it can lead to pretty complicated nested tables. Unfortunately, hive […]

A JSON read/write SerDe for Hive

July 11, 2011


Today I finished coding another SerDe for Hive which, with my employer’s permission, I published on github here: Since the code is still fresh in my mind, I thought I’d write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you’d have to […]

Writing a Hive SerDe for LWES event files

October 27, 2009


I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data that comes from the ad servers on which we need to perform various kinds of analysis. Writing a map-reduce job is not difficult in principle – it’s just time consuming and […]

Data Warehousing Books

October 27, 2009


With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that’s expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar […]

Joins in Hadoop using CompositeInputFormat

June 7, 2009


One of the first questions that a ‘traditional’ ETL engineer asks when learning hadoop is, “How do I do a join ?” For instance, how can we do in hadoop something like querying for the names of all employees who are in a California city: SELECT, from employees e INNER JOIN cities c […]

Setting up a JSF Maven project in NetBeans (including working autocompletion for JSP/JSF)

April 6, 2009


Today’s rich IDEs make a lot of tasks easier…usually. With Java and its IDEs you often end up spending more time than you anticipated to just set up a project, especially when dealing with the complexities of J2EE: there are multiple versions of the specifications 1.3,1.4,5.0), each one with multiple implementations by different vendors plus […]