Browsing All Posts filed under »programming«

Streaming Data to HTTP using Akka Streams with Exponential Backoff on 429 Too Many Requests

March 12, 2019 by

0

HTTP/REST is probably the most used protocol to exchange data between different services, especially in today's microservice world...

Basic Authorization and htaccess style authentication on the Play! Framework an Silhouete

August 18, 2018 by

0

Silhouette is probably the best library to implement authentication and authorization within the Play Framework. Git repo here : https://github.com/rcongiu/play-silhouette-basic-auth It is very powerful, as you can manage a common identity from multiple providers, so you can have users logging into your site from google, facebook, JWT,  and may other methods. It also allows you to fine […]

Custom Window Function in Spark to create Session IDs

October 29, 2017 by

0

(note: crossposted from my Nuvolatech Blog If you’ve worked with Spark, you have probably written some custom UDF or UDAFs. UDFs are ‘User Defined Functions’, so you can introduce complex logic in your queries/jobs, for instance, to calculate a digest for a string, or if you want to use a java/scala library in your queries.

Creating Nested data (Parquet) in Spark SQL/Hive from non-nested data

April 4, 2015 by

0

Sometimes you need to create denormalized data from normalized data, for instance if you have data that looks like CREATE TABLE flat ( propertyId string, propertyName String, roomname1 string, roomsize1 string, roomname2 string, roomsize2 int, .. ) but we want something like   CREATE TABLE nested ( propertyId string, propertyName string, rooms <array<struct<roomname:string,roomsize:int>> )   […]

Structured data in Hive: a generic UDF to sort arrays of structs

September 17, 2013 by

0

Introduction Hive has a rich and complex data model that supports maps, arrays and structs, that could be mixed and matched, leading to arbitrarily nested structures, like in JSON. I wrote about a JSON SerDe in another post and if you use it, you know it can lead to pretty complicated nested tables. Unfortunately, hive […]

A JSON read/write SerDe for Hive

July 11, 2011 by

1

Today I finished coding another SerDe for Hive which, with my employer’s permission, I published on github here: https://github.com/rcongiu/Hive-JSON-Serde.git. Since the code is still fresh in my mind, I thought I’d write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you’d have to […]

Writing a Hive SerDe for LWES event files

October 27, 2009 by

0

I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data that comes from the ad servers on which we need to perform various kinds of analysis. Writing a map-reduce job is not difficult in principle – it’s just time consuming and […]

Data Warehousing Books

October 27, 2009 by

0

With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that’s expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar […]

Joins in Hadoop using CompositeInputFormat

June 7, 2009 by

0

One of the first questions that a ‘traditional’ ETL engineer asks when learning hadoop is, “How do I do a join ?” For instance, how can we do in hadoop something like querying for the names of all employees who are in a California city: SELECT e.name, c.name from employees e INNER JOIN cities c […]

Setting up a JSF Maven project in NetBeans (including working autocompletion for JSP/JSF)

April 6, 2009 by

0

Today’s rich IDEs make a lot of tasks easier…usually. With Java and its IDEs you often end up spending more time than you anticipated to just set up a project, especially when dealing with the complexities of J2EE: there are multiple versions of the specifications 1.3,1.4,5.0), each one with multiple implementations by different vendors plus […]