Creating Nested data (Parquet) in Spark SQL/Hive from non-nested data

April 4, 2015

0

Sometimes you need to create denormalized data from normalized data, for instance if you have data that looks like CREATE TABLE flat ( propertyId string, propertyName String, roomname1 string, roomsize1 string, roomname2 string, roomsize2 int, .. ) but we want something like   CREATE TABLE nested ( propertyId string, propertyName string, rooms <array<struct<roomname:string,roomsize:int>> )   […]

Posted in: programming

Panna Cotta, my recipe.

January 10, 2015

0

Panna cotta is one of my favorite dessert and one you can enjoy at many Italian restaurants here in LA. It looks and sounds fancy, but it’s incredibly easy to make if you just get the right ingredients, in particular the gelatin. It is also very important to get very fresh ingredients, since it’s basically […]

Posted in: cooking

Structured data in Hive: a generic UDF to sort arrays of structs

September 17, 2013

0

Introduction Hive has a rich and complex data model that supports maps, arrays and structs, that could be mixed and matched, leading to arbitrarily nested structures, like in JSON. I wrote about a JSON SerDe in another post and if you use it, you know it can lead to pretty complicated nested tables. Unfortunately, hive […]

Posted in: programming

A JSON read/write SerDe for Hive

July 11, 2011

1

Today I finished coding another SerDe for Hive which, with my employer’s permission, I published on github here: https://github.com/rcongiu/Hive-JSON-Serde.git. Since the code is still fresh in my mind, I thought I’d write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you’d have to […]

Tagged: ,
Posted in: programming

Writing a Hive SerDe for LWES event files

October 27, 2009

0

I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data that comes from the ad servers on which we need to perform various kinds of analysis. Writing a map-reduce job is not difficult in principle – it’s just time consuming and […]

Tagged: ,
Posted in: programming

Data Warehousing Books

October 27, 2009

0

With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that’s expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar […]

Tagged:
Posted in: programming

Joins in Hadoop using CompositeInputFormat

June 7, 2009

0

One of the first questions that a ‘traditional’ ETL engineer asks when learning hadoop is, “How do I do a join ?” For instance, how can we do in hadoop something like querying for the names of all employees who are in a California city: SELECT e.name, c.name from employees e INNER JOIN cities c […]

Tagged: ,
Posted in: programming

Setting up a JSF Maven project in NetBeans (including working autocompletion for JSP/JSF)

April 6, 2009

0

Today’s rich IDEs make a lot of tasks easier…usually. With Java and its IDEs you often end up spending more time than you anticipated to just set up a project, especially when dealing with the complexities of J2EE: there are multiple versions of the specifications 1.3,1.4,5.0), each one with multiple implementations by different vendors plus […]

Tagged:
Posted in: programming