You are here

hadoop

A JSON read/write SerDe for Hive

Today I finished coding another SerDe for Hive which, with my employer's permission, I published on github here: https://github.com/rcongiu/Hive-JSON-Serde.git.

Since the code is still fresh in my mind, I thought I'd write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you'd have to read the hive code directly like I had to do.

Writing a SerDe in Hive for Lwes event files

I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data that comes from the ad servers on which we need to perform various kinds of analysis.

Writing a map-reduce job is not difficult in principle – it's just time consuming and requires the skills of a trained java engineer, which wouldn't be needed were we using SQL. That's where hive comes in: it allows us to query an hadoop data store using a flavor of SQL.

 

Data Warehousing books

With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that's expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar to linux 10 years ago. One of this technologies is Hadoop, a distributed filesystem and data processing framework based on Google's Map/Reduce paper. Hadoop powers Yahoo! Search, Facebook and many other sites' data warehouses.

Joins in Hadoop using CompositeInputFormat

One of the first questions that a 'traditional' ETL engineer asks when learning hadoop is, "How do I do a join ?"

For instance, how can we do in hadoop something like querying for the names of all employees who are in a California city:

SELECT e.name, c.name from employees e INNER JOIN cities c
    on e.city_id = c.id AND c.state ='CA'
Subscribe to RSS - hadoop