Roberto Congiu's blog

Streaming Data to HTTP using Akka Streams with Exponential Backoff on 429 Too Many Requests

rcongiu — Tue, 12 Mar 2019 13:57:15 +0000

HTTP/REST is probably the most used protocol to exchange data between different services, especially in today’s microservice world.

Before I even start, let me make it clear this project/example builds on this blog post from Colin Breck: https://blog.colinbreck.com/backoff-and-retry-error-handling-for-akka-streams/ .

I stumbled on his post while working on an integration project, but I wanted to build on it to include a couple more features, plus I wanted to put together a little github project that can be used to play with different features/settings, while also make it a little more general.

Working on integration projects using Akka Streams, I was looking for a way to send potentially lots of data streaming over HTTP, while at the same time slow down when the server starts to complain.

Now, the ‘proper’ way that a server tells the client to slow down is by sending a 429 Too Many Requests code.

In this case, the behavior we want is exponential backoff, that is, keep retrying (up to a maximum number of times) but an exponentially increasing amount of time (for instance, doubling it at every try).

This is well explained in Colin’s blog post here
https://blog.colinbreck.com/backoff-and-retry-error-handling-for-akka-streams/ , it is achieved by building a small stream for each request, that is retried on failure, by using akka streams’ built-in exponential backoff.

At first I was skeptical since constructing a stream for every request seemed expensive, however in my tests (ran on a local HTTP server) each successful request took only a few milliseconds, including parsing the response.

One improvement I needed however was being able to select which response should be retried and which one should just cause the stream to fail altogether.

// generate random data
val ids = 1.to(10).map { _ => Id(UUID.randomUUID().toString) }
// A shared kill switch used to terminate the *entire stream* if needed
val killSwitch = KillSwitches.shared("ks")
// create the stream, using the killSwitch.
val aggregate = Source(ids) // the data source
   .viaMat(killSwitch.flow)(Keep.right)
// up to 4 parallel connections to HTTP server
   .mapAsync(parallelism = 4)(getResponse) 
   .map(_.value)
 // we calculate the sum. Should be equal to ids.length
   .runWith(Sink.fold(0)(_ + _))

In the above code, we create the stream, posting simple data (a random UUID string) and a simple response (a number). For the sake of simplicity, the server, on success, returns a 1. So, in the stream we calculate the sum of all the ones. The materialized value of the stream must then be equal to the number of events the source will send, in this case, 10, no matter how many times we retried them.

This is the model for our request and our response.

  // our simple model
  final case class Id(id:String)
  final case class Response(value:Int)

The magic is done by the getResponse function, which creates a mini-stream with backoff, as also described by Colin’s blog post:

// use a superpool vs SingleRequest: connection pooling, automatic keepalive, etc.
val pool = Http().superPool[Id]()  

def getResponse(id: Id): Future[Model.Response] = timedFuture {
    RestartSource.withBackoff(
      minBackoff = 20.milliseconds,
      maxBackoff = 30.seconds,
      randomFactor = 0.2,
      maxRestarts = 10
    ) { () =>
        val request = HttpRequest(uri = s"http://localhost:8080/${id.id}")
        Source.single((request, id)).via(pool)
          .mapAsync(parallelism = 1)(handleResponse)
      }.runWith(Sink.head).recover {
        case _ => throw StreamFailedAfterMaxRetriesExceptionForId(id)
      }
  }

Now, lots of going on here. We’re connecting to our own test server, posting a UUID. We are using a SuperPool which will handle HTTP connections, keepalive, etc, for us (see https://doc.akka.io/docs/akka-http/current/client-side/request-level.html#flow-based-variant).

We set all the parameters for exponential backoff (see https://doc.akka.io/docs/akka/2.5/stream/operators/RestartSource/withBackoff.html ) . The request will be tried a maximum of 10 times before failing. We are catching the failure and rethrowing it with our own error.

Now, the last part is the handleResponse function. You’ll likely need to change this function to fit your needs, since it depends on how the server you’re dealing with behaves. Some will actually be able to rate limit your requests by sending back 429 errors, but some other may not be sophisticated enough and just start failing with 500s.

It also depends on your requirements, there may be a condition (like a 500) that means you’re done streaming. The code below shows how to handle the two situations differently, by just applying exponential backoff on 429 and terminating the stream on 500:

  val handleResponse:PartialFunction[(Try[HttpResponse],Id),Future[Response]]  =  {
      // we can connect, and we get a valid response. Response however could be 500, etc.
    case (Success(s), _) => s match {
      case HttpResponse(StatusCodes.OK, _, entity, _) =>
        Unmarshal(entity).to[Response]
      case HttpResponse(StatusCodes.TooManyRequests,_,entity,_) =>
        entity.discardBytes()
        throw TooManyRequestsException
      case HttpResponse(statusCode, _, entity, _) =>
        entity.discardBytes()
        killSwitch.shutdown()
        throw DatabaseUnexpectedException(statusCode)
    }
    // something went wrong, can't connect.
    case (Failure(f), _) => throw f
  }

It’s worth noticing that exponential backoff happens also when you can’t connect to the server, by handling first the Try, which will fail if we can’t connect to the server, and then handling different codes withing a Success, which means we got a response from the server, but the server may be returning an error.

To test it, we can create first a server that randomly returns either OK or Too Many Requests:

  val server = new TestHttpServer(0.2D, StatusCodes.TooManyRequests)

You’ll see output like the following:

 
Return ok
 Return ok
 Return error
 Return ok
 [WARN] [03/12/2019 13:12:25.770] [QuickStart-akka.actor.default-dispatcher-9] [RestartWithBackoffSource(akka://QuickStart)] Restarting graph due to failure. stack_trace: 
 com.nuvola_tech.akka.streams.Model$TooManyRequestsException$
....

Return ok
 (Sum : 10, 
      Map(6ab726e5-78c9-44ed-830b-c94dff66c7c3 -> 1, 
          f931f6fa-7917-48ad-bb58-5a5ec53f65ee -> 1, 
          d3ece91a-2b5e-4868-bc43-f60ef1d657dd -> 2, 
          7e943c2f-fc08-44fa-8667-be1b0e7253d7 -> 1, 
          5496e659-9083-49cd-8864-88af82e49c20 -> 1, 
          0dc9a0ab-e0ac-45e9-be26-a1bfdf8dd5ce -> 2, 
          64486cfc-a1ac-4663-aed5-51bd547765bb -> 1, 
          b39521a2-7592-4508-bf6f-81a7cedd33a5 -> 1, 
          33e8e8ed-c7f9-49e4-bd1c-e6ce83e2150c -> 1, 
          56b6547b-a635-48cb-98aa-004de4abff65 -> 1))
 Execution time avg: 98: List(46, 3, 3, 371, 3, 3, 4, 351)

 all terminated

The (edited) output above shows us that two tries failed and were retried. This is also reflected on the execution times. The sum is correct (10).

If we set error probability = 1, the stream will fail after it retries the request up to its maximum number of times, with com.nuvola_tech.akka.streams.Model$StreamFailedAfterMaxRetriesExceptionForId

Last, if you set the Test Server to return InternalServerError, the stream will terminate at the first error:


Starting HTTP server
 Return ok
 Return ok
 Return ok
 Return error
 [WARN] [03/12/2019 13:20:31.399] [QuickStart-akka.actor.default-dispatcher-10] [RestartWithBackoffSource(akka://QuickStart)] Restarting graph due to failure. stack_trace: 
 com.nuvola_tech.akka.streams.Model$DatabaseUnexpectedException
Return ok
 (Sum : 4, 
    Map(85966d9d-bbb3-4560-a06f-784cc20f043f -> 1, 
        42503ccb-b194-4d21-a831-f92a9ca21e47 -> 1, 
        5a83b3ef-4816-4b77-8058-df7ee90c9bc1 -> 2, 
        5d716bcd-42c9-4a76-9503-b662d10d1455 -> 1))
 Execution time avg: 337: List(351, 324)





Note that the sum is now 4, because only 4 calls were successful, but also that it was able to retry one before the kill switch (which works in parallel) was able to terminate the stream. This was ok in my case.



You can play with the code on github here: https://github.com/nuvolatech/akka-streams-http-example



Basic Authorization and htaccess style authentication on the Play! Framework an Silhouete
Roberto Congiu — Sat, 18 Aug 2018 22:09:33 +0000
Silhouette is probably the best library to implement authentication and authorization within the Play Framework.
Git repo here : https://github.com/rcongiu/play-silhouette-basic-auth
It is very powerful, as you can manage a common identity from multiple providers, so you can have users logging into your site from google, facebook, JWT,  and may other methods.
It also allows you to fine tune both authentication – that a user has valid credentials – and authorizaton – that after being authentication, that user also has the right permissions to access a particular resource.
Its API is very elegant, as you can just change the type of your controller Action to SecuredAction.
It is however a pretty sizable framework, and it can be daunting as a beginner.  It needs you to set up identities, often on a database, and you have to build your user credentials management.
Sometimes however you may just want a very simple authentication, for example, when prototyping or writing a Proof of Concept (POC). You also may be tasked to replace an application running on Apache or NGINX that uses a htpasswd password file.
I looked around to find an example for implementing Basic Authentication with play and I was pretty surprised that I couldn’t quite find one.

Let me be clear here, Basic Authentication is not the best idea for security, but sometimes you just need something that does the job of protecting your app with a password, but you don’t have the time to deal with full blown user management.

As a side note, if you’re working with containers, you could use a nginx reverse proxy to manage authentication, but sometimes you can’t do that. In my case, the play application had specific logic to execute on authentication failure, so I couldn’t just delegate it to nginx.

Or as I just said, you may just want to be backward compatible with an older app that uses an htpasswd file.
In this case, you can use the code here as a template.
Setting up the Dependency Injection and the Environment
Silhouette relies heavily on DI to configure its objects. First of all you have to configure an Environment, where you declare what’s yours user model and what it’s going to authenticate the user.
From silhouette’s documentation, an environment may look like:
trait SessionEnv extends Env {
  type I = User
  type A = SessionAuthenticator
}

We need our User definition – it depends on your case, he User class will hold all the relevant User information you want to be passed to the controller.

In this example, I will keep it minimal and will only keep the username there. In more complex cases you may want the date of last login, the email, name, etc.
Our definition is again the simplest, this is our utils/auth/User.scala:
package utils.auth

import com.mohiva.play.silhouette.api.Identity

/**
  * Model for your users. In Basic Auth, you may only have a username,
  * so we keep it simple and we just have the username here.
  * @param username
  */
case class User(username: String) extends Identity

And in the Module file, where we define all the DI and build the environment, we have:
import com.google.inject.{AbstractModule, Provides}
import java.time.Clock

import com.mohiva.play.silhouette.api.repositories.AuthInfoRepository
import com.mohiva.play.silhouette.api._
import com.mohiva.play.silhouette.api.services.AuthenticatorService
import com.mohiva.play.silhouette.api.util.PasswordHasherRegistry
import com.mohiva.play.silhouette.impl.authenticators.{DummyAuthenticator, DummyAuthenticatorService}
import com.mohiva.play.silhouette.impl.providers.BasicAuthProvider
import com.mohiva.play.silhouette.password.BCryptPasswordHasher
import _root_.services._
import utils.auth._
import utils.auth.repositories.HtpasswdAuthInfoRepository
// use scala guice binding
import net.codingwell.scalaguice.{ ScalaModule, ScalaPrivateModule }
import scala.concurrent.ExecutionContext.Implicits.global

class Module extends AbstractModule with ScalaModule {

  override def configure() = {
    // Use the system clock as the default implementation of Clock
    bind[Clock].toInstance(Clock.systemDefaultZone)
    // Ask Guice to create an instance of ApplicationTimer when the
    // application starts.
    bind[ApplicationTimer].asEagerSingleton()
    // Set AtomicCounter as the implementation for Counter.
    bind[Counter].to[AtomicCounter]

    // authentication - silhouette bindings
    bind[Silhouette[DefaultEnv]].to[SilhouetteProvider[DefaultEnv]]
    bind[RequestProvider].to[BasicAuthProvider].asEagerSingleton()

    bind[UserService].to[ConfigUserServiceImpl]
    bind[PasswordHasherRegistry].toInstance(PasswordHasherRegistry(
      current = new BCryptPasswordHasher(),
      // if you want
      // current = new DummyPasswordHasher(),
      deprecated = Seq()
    ))
    bind[AuthenticatorService[DummyAuthenticator]].toInstance(new DummyAuthenticatorService)

    // configure a single username/password in play config
    //bind[AuthInfoRepository].to[ConfigAuthInfoRepository].asEagerSingleton()
    // or bind to htpasswd, set its location and its crypto hashing algorithm in config
    bind[AuthInfoRepository].to[HtpasswdAuthInfoRepository]
  }


  @Provides
  def provideEnvironment(
                          userService:          UserService,
                          authenticatorService: AuthenticatorService[DummyAuthenticator],
                          eventBus:             EventBus,
                          requestProvider:      RequestProvider
                        ): Environment[DefaultEnv] = {

    Environment[DefaultEnv](
      userService,
      authenticatorService,
      Seq(requestProvider),
      eventBus
    )
  }
}

This is actually what does most of the magic. A few comments:

We use @provides for the method that builds the environment
We use a simple UserService that creates the User object from the loginInfo:
override def retrieve(loginInfo: LoginInfo): 
      Future[Option[User]] = Future { 
           Some(User(loginInfo.providerKey)) }
As you can see, it creates a User object using the username entered.

We created both an abstract UserService trait, and an implementation, ConfigUserServiceImpl
We use DummyAuthenticator for Basic Authorization because the Authenticator is needed only when some state is saved between two HTTP requests (cookie,session), while in Basic Authentication every request is authenticated through a Request Authenticator.

The request authenticator is passed in the environment as
Seq(requestProvider)
 and dinamically injected with 
bind[RequestProvider].to[BasicAuthProvider].asEagerSingleton()

The password could be stored anywhere. I wrote two classes to get the password from a htpasswd file or from the play config. You bind one with either
bind[AuthInfoRepository].to[HtpasswdAuthInfoRepository]
 or 
bind[AuthInfoRepository].to[ConfigAuthInfoRepository]

You also have to pick the hashers, since the password is usually stored hashed. It is done in
bind[PasswordHasherRegistry].toInstance(PasswordHasherRegistry(
      current = new BCryptPasswordHasher(),
      // if you want
      // current = new DummyPasswordHasher(),
      deprecated = Seq()
    ))


Using htpasswd
You can use apache’s htpasswd to generate a password file to be used with these classes:
htpasswd -c -B  filename myuser

If you’re using htpasswd, you have to  
bind[AuthInfoRepository].to[HtpasswdAuthInfoRepository]
 and configure where htpasswd is in the play configuration setting security.htpasswd.file.
That class reads htpasswd file, retrieves the user’s hashed password and compares it to the hashed supplied password.

Note that only bcrypt is supported (no md5, crypto).
Hashing, or not hashing
Sometimes you want the password stored in cleartext, it is insecure, but it may be just a simple prototype, in that case use:
  bind[PasswordHasherRegistry].toInstance(PasswordHasherRegistry(
      current = new DummyPasswordHasher(),
      deprecated = Seq()
    ))

I can’t stress enough how insecure it is to use a cleartext password, but sometimes you may be using a combination of other systems and it may be your only choice. I list it here as a last resort kind of tool.
Hope you enjoyed this article about implementing the simplest yet quickest kind of authentication on a play app.

If you want to integrate it quickly check it out from github and just plug in the auth directory and add the bindings in Module.scala.

Git repo here : https://github.com/rcongiu/play-silhouette-basic-auth



Custom Window Function in Spark to create Session IDs
Roberto Congiu — Sun, 29 Oct 2017 01:19:35 +0000
(note: crossposted from my Nuvolatech Blog
If you’ve worked with Spark, you have probably written some custom UDF or UDAFs.

UDFs are ‘User Defined Functions’, so you can introduce complex logic in your queries/jobs, for instance, to calculate a digest for a string, or if you want to use a java/scala library in your queries.


UDAF stands for ‘User Defined Aggregate Function’ and it works on aggregates, so you can implement functions that can be used in a GROUP BY clause, similar to AVG.
You may not be familiar with Window functions, which are similar to aggregate functions, but they add a layer of complexity, since they are applied within a PARTITION BY clause. An example of window function is RANK(). You can read more about window functions here.
While aggregate functions work over a group, window functions work over a logical window of record and allow you to produce new columns from the combination of a record and one or more records in the window.

Describing what window functions are is beyond the scope of this article, so for that refer to the previously mentioned article from Databricks, but in particular, we are interested at the ‘previous event in time for a user’ in order to figure out sessions.
There is plenty of documentation on how to write UDFs and UDAFs, see for instance This link for UDFs or this link for UDAFs.
I was surprised to find out there’s not much info on how to build an custom window function, so I dug up the source code for spark and started looking at how window functions are implemented. That opened to me a whole new world, since Window functions, although conceptually similar to UDAFs, use a lower level Spark API than UDAFs, they are written using Catalyst expressions.
Sessionization basics
Now, for what kind of problem do we need window functions in the first place?

A common problem when working on any kind of website, is to determine ‘user sessions’, periods of user activity. if an user is inactive for a certain time T, then it’s considered a new ‘session’. Statistics over sessions are used to determine for instance if the user is a bot, to find out what pages have the most activity, etc.
Let’s say that we consider a session over if we don’t see any activity for one hour (sixty minutes). Let’s see an example of user activity, where ‘event’ has the name of the page the user visited and time is the time of the event. I simplified it, since the event would be a URL, while the time would be a full timestamp, and the session id would be generated as a random UUID, but I put simpler names/times just to illustrate the logic.



user
event
time
session




user1
page1
10:12session1 (new session)


user1
page2
10:20session1 (same session, 8 minutes from last event)


user1
page1
11:13session1 (same session, 53 minutes from last event)


user1
page3
14:12session2 (new session, 3 hours after last event)



Note that this is the activity for one user. We do have many users, and in fact partitioning by user is the job of the window function.
Digging in
It’s better to use an example to illustrate how the function works in respect of the window definition.

Let’s assume we have a very simple user activity data, with a user ID called user, while ts is a numeric timestamp and session is a session ID, that may be already present. While we may start with no session whatsoever, in most practical cases, we may be processing data hourly, so at hour N + 1 we want to continue the sessions

we calculated at hour n.
Let’s create some test data and show what we want to achieve.
// our Data Definition
case class UserActivityData(user:String, ts:Long, session:String)

// our sample data
val d = Array[UserActivityData](
    UserActivityData("user1",  st, "ss1"),
    UserActivityData("user2",  st +   5*one_minute, null),
    UserActivityData("user1",  st +  10*one_minute, null),
    UserActivityData("user1",  st +  15*one_minute, null),
    UserActivityData("user2",  st +  15*one_minute, null),
    UserActivityData("user1",  st + 140*one_minute, null),
    UserActivityData("user1",  st + 160*one_minute, null))

// creating the DataFrame
val sqlContext = new SQLContext(sc)
val df = sqlContext.createDataFrame(sc.parallelize(d))

// Window specification
val specs = Window.partitionBy(f.col("user")).orderBy(f.col("ts").asc)
// create the session
val res = df.withColumn( "newsession", 
   calculateSession(f.col("ts"), f.col("session")) over specs)

First, the window specification. Sessions are create per user, and the ordering is of course by timestamp.

Hence, we want to apply the function partitionBy user and orderBy timestamp.
We want to write a createSession function that will use the following logic:


IF(no previous event) create new session

ELSE (if current event was past session window)

THEN create new session

ELSE use current session


and will produce something like this:



user
ts
session
newsession


user1
1508863564166
f237e656-1e..
f237e656-1e..


user1
1508864164166
null
f237e656-1e..


user1
1508864464166
null
f237e656-1e5..


user1
1508871964166
null
51c05c35-6f..


user1
1508873164166
null
51c05c35-6f..


user2
1508863864166
null
2c16b61a-6c..


user2
1508864464166
null
2c16b61a-6c..



 
Note that we are using random UUIDs as it’s pretty much the standard, and we’re shortening them for typographical reasons.
As you see, for each user, it will create a new session whenever the difference between two events is bigger than the session threshold.
Internally, for every record, we want to keep track of:

The current session ID
The timestamp of the previous session

This is going to be the state that we must maintain. Spark takes care of initializing it for us.

It is also going to be the parameters the function expects.
Let’s see the skeleton of the function:
// object to collect my UDWFs
object MyUDWF {
  val defaultSessionLengthms = 3600 * 1000 // longer than this, and it's a new session

  case class SessionUDWF(timestamp:Expression, session:Expression,
           sessionWindow:Expression = Literal(defaultMaxSessionLengthms)) 
      extends AggregateWindowFunction {
    self: Product =>

    override def children: Seq[Expression] = Seq(timestamp, session)
    override def dataType: DataType = StringType

    protected val zero = Literal( 0L )
    protected val nullString = Literal(null:String)

    protected val curentSession = AttributeReference("currentSession", 
                   StringType, nullable = true)()
    protected val previousTs =    AttributeReference("previousTs", 
                   LongType, nullable = false)()

    override val aggBufferAttributes: Seq[AttributeReference] =  
                    curentSession  :: previousTs :: Nil

 
    override val initialValues: Seq[Expression] =  nullString :: zero :: Nil
    override def prettyName: String = "makeSession"

    // we have to write these ones
    override val updateExpressions: Seq[Expression] = ...
    override val evaluateExpression: Expression = ...
  
  }
}

A few notes here:

Our ‘state’ is going to be a Seq[AttributeReference]
Each AttributeReference must be declared with its type. As we said, we keep the current Session and the timestamp of the previous one.
We inizialize it by overriding initialValues
For every record, within the window, spark will call first updateExpressions, then will produce the values calling evaluateExpression

Now it’s time to implement the updateExpressionsand evaluateExpression functions.
 
    // this is invoked whenever we need to create a a new session ID. You can use your own logic, here we create UUIDs
    protected val  createNewSession = () => org.apache.spark.unsafe.types.
              UTF8String.fromString(UUID.randomUUID().toString)

    // initialize with no session, zero previous timestamp
    override val initialValues: Seq[Expression] =  nullString :: zero :: Nil

    // if a session is already assigned, keep it, otherwise, assign one
    override val updateExpressions: Seq[Expression] =
      If(IsNotNull(session), session, assignSession) ::
        timestamp ::
        Nil

    // assign session: if previous timestamp was longer than interval, 
    // new session, otherwise, keep current.
    protected val assignSession =  If(LessThanOrEqual(
          Subtract(timestamp, aggBufferAttributes(1)), sessionWindow),
      aggBufferAttributes(0), 
      ScalaUDF( createNewSession, StringType, children = Nil))

    // just return the current session in the buffer
    override val evaluateExpression: Expression = aggBufferAttributes(0)


Notice how we use catalyst expressions, while in normal UDAFs we just use plain scala expressions.
Last thing, we need to declare a static method that we can invoke from the query that will instantiate the function. Notice how I created two, one that allows the user to specify what’s the max duration of a session, and one that takes the default:
  def calculateSession(ts:Column,sess:Column): Column = 
         withExpr { 
           SessionUDWF(ts.expr,sess.expr, Literal(defaultMaxSessionLengthms)) 
         }
  def calculateSession(ts:Column,sess:Column, sessionWindow:Column): Column =
         withExpr { 
            SessionUDWF(ts.expr,sess.expr, sessionWindow.expr) 
       }

Now creating session IDs is as easy as:
// Window specification
val specs = Window.partitionBy(f.col("user")).orderBy(f.col("ts").asc)
// create the session
val res = df.withColumn( "newsession", 
   calculateSession(f.col("ts"), f.col("session"), 
     f.lit(10*1000) over specs) // 10 seconds. Duration is in ms.

Notice that here we specified 10 second sessions.
There’s a little more piping involved which was omitted for clarity, but you can find the complete code, including unit tests, in my github project



Creating Nested data (Parquet) in Spark SQL/Hive from non-nested data
Roberto Congiu — Sat, 04 Apr 2015 20:55:17 +0000

Sometimes you need to create denormalized data from normalized data, for instance if you have data that looks like

CREATE TABLE flat (
  propertyId string,
  propertyName String,
  roomname1 string,
  roomsize1 string,
  roomname2 string,
  roomsize2 int,
  ..
)
but we want something like
 
CREATE TABLE nested (
   propertyId string,
   propertyName string,
   rooms >
)
 
This can be done with a pretty horrific query, but we want to do it in spark sql by manipulating the rows programmatically.

Let’s see step by step, loading data from a CSV file with a flat structure, and inserting in a nested hive table.

These commands can be run from spark-shell. Later, when we write the buildRecord() function, we’ll have to wrap everything in an object because any code that is going to be executed in the workers needs to extend the Serializable trait.
 
import com.databricks.spark.csv._
import org.apache.spark.sql.Row
// build hive context
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
// load data (flat)
val data = hc.csvFile("hdfs:///data/data.csv")
// make hive aware of our RDD as a table
hc.registerRDDAsTable(data,"data")

Notice that we used
hc.registerRDDAsTable(data)
instead of
data.registerTempTable()
I found that in spark 1.2.1, registerTempTable won’t make the table available to hive, and if you want to transfer data between actual hive tables and temporary tables, you have to use registerRDDAsTable or you’ll get a ‘table not found’ error from hive.
 
SchemaRDDs return data in form of object of class Row. Row is also how SchemaRDDs expect to receive data and hive tables are basically one form of SchemaRDDs.

If an RDD built from a CVS file had the same schema we could just do something like
hc.sql("insert into table1 select * from table2")
 
but in this case, before inserting, we have to transform the data so it has the same structure as the table we want to put it in.
We observe also that the structure of the record is two scalars, followed by an array of four structs.
We want to store it in a hive nested table, so we create it:
 
  hc.sql("""CREATE TABLE IF NOT EXISTS nested (
   propertyId string,
   propertyName string,
   rooms array>
) STORED AS PARQUET
""")
 
We can then build the record as:
 val nestedRDD = data.map(buildRecord(_))

// this builds a nested record
 def buildRecord(r:Row):Row = {
        println(r)
    var res  = Seq[Any]()
    // takes the first two elements
    res = res ++ r.slice(0,2) 
    // now res = [ 'some id','some name']

    // this will contain all the array elements
    var ary = Seq[Any]() 
    // we assume there are 2 groups of columns
    for (i <- 0 to 1 ) {      
       // 0-based indexes, takes (2,3) (4,5) .. 
       //and converts to appropriate type
       ary = ary :+ Row( r.getString( 2 + 2 * i), 
                         r.getString(2 + 1 + 2*i).toInt )
    }
    // adds array as an element and returns it
    res = res :+ ary 
    Row.fromSeq(res)
  }
}

Notice a few things here:

we had to convert the data approriately. CSV files have a header with the field name, but not the type, so we must know in advance how to convert data. This could be done with a case class in Scala.
We convert the scalars, when we have an array we just build a sequence (in this case a list), and when we have  a struct we use Row
Rows can be built in two ways, one as Row( element1, element2,..), but if you want to build them from a sequence, use Row.fromSeq like above.

Assuming the table called ‘nested’ was created as the CREATE TABLE definition earlier, we can use it to infer its schema and apply it to the newly built rdd.
// copy schema from hive table and apply to RDD
val nested = hc.sql("select * from nested limit 0")
val nestedRDDwithSchema = hc.applySchema(nestedRDD, nested.schema)
 
now we can insert, after registering the new rdd as a table
 
hc.registerRDDAsTable(nestedRDDwithSchema, "nestedrdd")
hc.sql("insert into nested select * from nestedrdd")
 
et voilà !

Now data is available in hive/parquet/sparksql as nested:
hive> select * from nested;
OK
bhaa123	My house	[{"roomname":"kitchen","roomsize":134},{"roomname":"bedroom","roomsize":345}]
pasa372	Other house	[{"roomname":"living room","roomsize":433},{"roomname":"bedroom","roomsize":332}]
Let’s see the complete code:
import com.databricks.spark.csv._
import org.apache.spark.sql.Row
import org.apache.spark.{SparkConf,SparkContext}

object nesting extends Serializable {
 def main(args: Array[String])  {
  val sc = new SparkContext(new SparkConf())

  val hc = new org.apache.spark.sql.hive.HiveContext(sc)

  // Change this to your file location
  val data = hc.csvFile("file:///data/data.csv")
  hc.registerRDDAsTable(data,"data")

  hc.sql("""CREATE TABLE IF NOT EXISTS nested (
   propertyId string,
   propertyName string,
   rooms array>
) STORED AS PARQUET
""")

  val nestedRDD = data.map(buildRecord(_))

  // after the map, Spark does not know the schema
  // of the result RDD. We can just copy it from the
  // hive table using applySchema
  val nested = hc.sql("select * from nested limit 0")
  val schemaNestedRDD = hc.applySchema(nestedRDD, nested.schema)

  hc.registerRDDAsTable(schemaNestedRDD,"schemanestedrdd")

  hc.sql("insert overwrite table nested select * from schemanestedrdd")

 }

  def buildRecord(r:Row):Row = {
        println(r)
    var res  = Seq[Any]()
    // takes the first two elements
    res = res ++ r.slice(0,2) 
    // now res = [ 'some id','some name']
    var ary = Seq[Any]() // this will contain all the array elements
    // we assume there are 2 groups of columns
    for (i <- 0 to 1 ) {      
       // 0-based indexes, takes (2,3) (4,5) .. 
       // and converts to appropriate type
       ary = ary :+ Row( r.getString( 2 + 2 * i), 
                         r.getString(2 + 1 + 2*i).toInt )
    }
    // adds array as an element and returns it
    res = res :+ ary 
    Row.fromSeq(res)
  }
}
 
The code is available on github here: https://github.com/rcongiu/spark-nested-example



Panna Cotta, my recipe.
Roberto Congiu — Sat, 10 Jan 2015 23:45:14 +0000

Panna cotta is one of my favorite dessert and one you can enjoy at many Italian restaurants here in LA. It looks and sounds fancy, but it’s incredibly easy to make if you just get the right ingredients, in particular the gelatin. It is also very important to get very fresh ingredients, since it’s basically solid cream, the fresher the cream, the better.

Here it is!!
Preparation time
20 minutes of work, a couple of hours of waiting for it to chill in the fridge. In total, about 3 hours.

Nutritional Values

Serving size: 100g/3.5oz
Calories per serving:223
Fat per serving: 12g
Carbs per serving: 23g


Ingredients

150g /5 oz  Sugar
500 ml/ 2  cups cream
One   vanilla bean (I got a bunch, from Amazon, but you may want to get fewer. It’s around $1/bean).
Gelatin (I got it in sheets for $4).
Some butter (optional, to grease the ramekins/mold).

Equipment needed

Saucepan
A wooden spoon
A container to soak the gelatin
A pairing knife for the vanilla bean
A mold or ramekins (you could make one big panna cotta, or make 4 individual servings).
A stainless steel sieve (optional), like one of these.

Procedure

First thing, soak 2 sheets of gelatin in a container with COLD water. Let it soak and soften (but not dissolve!) for at least 10 minutes (but not too much more than that!) while you prepare the other ingredients.
With a pairing knife, make an incision on the vanilla bean, and scrape its contents. You can learn how to do it from this youtube video.
Put in the saucepan the cream, sugar, the scraped contents of the vanilla bean and the skin of the vanilla bean. The scraped bean still has lots of flavor!!
Put the saucepan over medium heat. DON’T LET IT BOIL!! We still want that fresh milk flavor, not cooked milk. The goal is to heat the cream up enough to completely dissolve the gelatin.
Once the cream is hot like hot coffee, but not boiling, take the gelatin sheets, squeeze them a little and add them to the cream. Lower the heat and stir until all the gelatin is completely dissolved (a couple of minutes). Now you can turn the heat off, and using the sieve, pour the  cream into the mold or ramekins, which you’ll have preemptively buttered. The sieve will trap the vanilla bean, which you could use to flavor your sugar bowl.
At this point, you can put them directly in the fridge, but it’s better to let them cool down a bit, preferably in a water/ice water bath. It will take a few hours for them to solidify.
You can serve it with some fruit sauce, made by cooking for about 20 minutes some blended mixed berries with whole berries and sugar, and letting the sauce cool. It’s also good with chocolate sauce!

I made it for a friends birthday, and as you can see in the picture on the right, it looks good even next to the dessert bought from the bakery! It’s the on the bottom if you hadn’t guessed!
Variations
You can make it lighter by substituting some cream with whole milk (one cup cream, one cup milk instead of 2 cups of cream).
Also, instead of vanilla, you could use two tablespoons of powdered cocoa (the unsweetened one, since there’s already sugar in the recipe) for chocolate panna cotta.
Buon appetito!!!





Structured data in Hive: a generic UDF to sort arrays of structs
Roberto Congiu — Tue, 17 Sep 2013 10:20:29 +0000
Introduction
Hive has a rich and complex data model that supports maps, arrays and structs, that could be mixed and matched, leading to arbitrarily nested structures, like in JSON. I wrote about a JSON SerDe in another post and if you use it, you know it can lead to pretty complicated nested tables.
Unfortunately, hive HQL has not a lot of built in functions to deal with nested data.
One frequent need we have is to sort an array of structs by an arbitrary field. That is, if your table definition is something like
CREATE TABLE mytable (
    ....
   friends array>,
 ...
)
you may actually have the friends array sorted by either age or name, depending on what you want to do with it. Unfortunately the built in sort function will sort array elements according to their natural order, but it is not clear what that would be for a struct, I guess it will order by the first struct field, then the second.
I thought this would be a good example of a simple generic UDF, since the actual sorting operation is trivial in java, so we can focus on the piping we’ll have to put around it.  You can find other examples of generic UDFs, for instance this one , I wanted to write a simple one to expose only the essential piping around a sort operation.
Our goal is to write a generic UDF that could be used on the above table to do something like:
SELECT sort_array_by(friends,'age') from mytable
to sort your array of friends by the age field.
UDFs vs Generic UDFs
Let’s take a step back and have a look at that ‘generic’ in UDF. You may not actually be familiar with UDFs at all, let’s see what an UDF (aka ‘plain’ UDF) is and how Generic UDFs differ from UDFs. A good reading about the differences can be found here.
A UDF , or ‘User Defined Function’ is a function you can code in java to perform something you can’t do in SQL. For instance here at OpenX where I work, we have written UDFs to perform currency conversions so we can write things like
SELECT SUM(currency_convert( p_coin, 'USD', p_revenue)) FROM ..
where currency_convert is a function with a signature (STRING cur1,STRING cur2,BIGINT amount) that converts amount from cur1 to cur2 (in our example, whatever the publisher currency is to USD).
This function is coded as a ‘plain’ UDF, so it’s coded as a java class that implements an evaluate method with that signature:
public class CurrencyConverter extends UDF {
    CurrencyExchangeTable convTable = null;

    CurrencyExchangeTable getExchangeTable() throws Exception {
	if(convTable == null) {
	    convTable = new CurrencyExchangeTable();
	} 
	return convTable;
    }

    public long evaluate(final String from , final String to, long amount) 
            throws Exception {
        return (long) (amount * getExchangeTable().getConversionRate(from.toUpperCase(),to.toUpperCase()));
    }
}
 
This is pretty simple to write, and hive will infer the evaluate method to call and the return value type from your class using Java reflection.
In Generic UDFs that is not the case and the developer needs to code the parameter/return value type and value handling with more flexibility and better performance at the price of some added complexity. Also, AFAIK plain UDFs cannot handle complex (nested) types so if that’s what you need, you have to use Generic UDFs.
The bottom line is, with plain UDFs your input parameters are fixed in number and type, as is its return value, while in a Generic UDF you can code it to accept pretty much any set of inputs and outputs. Also, plain UDFs use java reflection and could be considerable slower than a generic UDF.
 
Hive Data Structures, ObjectInspector and DeferredObject


I mentioned that Hive has a riche set of data structures. Beside primitive types (int, float, bigint, string, etc) you can have structures that can contain either primitives or other structures, so you could build an arbitrarily nested data structure. You can use:

arrays/lists: can contain a set of elements, all of the same type
maps: can contain key/value pairs, with keys of one type and values of the some or other type
structs: they contain an ordered list of values, each one could be of a different type (like C structs)

a record in a table is modeled as a struct, as each record is a set of values, each one with possibly a different type.
The interesting part is how hive implements these structures. You’d expect that they’d be mapped as the correspondent java structures, but it’s a little different from that. Whenever we need to read data from the serialized format, a java.lang.Object reference is passed as data, together with an associated ObjectInspector.  ObjectInspectors have a category associated to it that can be one of  PRIMITIVE, LIST, MAP, STRUCT or UNION. Let’s skip UNION (which is used in SELECT .. UNION ), we see that each category maps to a different data structure. This way, hive does not need to use introspection to read the data, nor the deserialized object need to implement any interface or extend any subclass, but applies the Adapter Design Pattern to present an uniform interface to objects of different type.
For example, if you’re deserializing JSON data, a deserialized JSON object can be thought as a struct: it does have a set of values, each one possibly of a different type. To make Hive understand it, we can write a JsonStructObjectInspector as an adapter from an Hive struct to a JSON Object.

Now, the other piece of the equation in generic UDFs is the DeferredObject.

Since we’ll be executing UDFs for every record (like in the SerDe) we want to be as efficient as possible. That means that we don’t want to deserialize data unless it’s really needed, as we’d be creating unnecessary Java Objects and more work for the garbage collector.
Hive solves this with DeferredObject, which holds something that may or not be deserialized. Generic UDFs gets their parameters passed as arrays of DeferredObject, together with a matched array of ObjectInspectors. ObjectInspectors are passed during initialization, while DeferredObjects are passed for each record. So the skeleton of our generic UDF would be something like:
public class ArrayStructSortUDF extends GenericUDF {
    protected ObjectInspector[] argumentOIs;

     @Override 
    public ObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
              argumentOIs = ois;
     }

    @Override
    public Object evaluate(DeferredObject[] dos) throws HiveException {
    ....
    }
}
 
As you see above, we store the ObjectInspectors we get from hive for use later. Each ObjectInspector holds the list of field types that we can expect when the function is called for each value and it’s up to us to decide what we can deal with and what we don’t like. We can also decide how many parameters we’d like. For instance,  we could rewrite the CONCAT UDF that takes an arbitrary number of parameters of any type  and returns a string with all those parameters converted to string and concatenated. You could not do that with a ‘plain’ UDF.
Also, it can do short-circuit evaluation, that is, ignore parameters it doesn’t need saving the cost of deserializing them (like COALESCE that returns the first non-null parameter).
Note that the initialize method returns an ObjectInspector: that’s the ObjectInspector that can read the result, so in the initialize method is also where you decide the type of your result. In the example that follows, the ObjectInspector of the result is the same as the one for the first argument, since we’re returning the first argument sorted, but you can see a more sophisticated initialize method  in the example here: http://www.baynote.com/2012/11/a-word-from-the-engineers/.
Putting all together
here’s the code, reproduced with permission from my employer OpenX
package com.congiu.udf;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import static org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category.LIST;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;

/**
 *
 * @author rcongiu
 */
@Description(name = "array_struct_sort",
	value = "_FUNC_(array(struct1,struct2,...), string myfield) - "
	+ "returns the passed array struct, ordered by the given field  " ,
	extended = "Example:\n"
	+ "  > SELECT _FUNC_(str, 'myfield') FROM src LIMIT 1;\n"
	+ " 'b' ")
public class ArrayStructSortUDF extends GenericUDF {
    protected ObjectInspector[] argumentOIs;

    ListObjectInspector loi;
    StructObjectInspector elOi;

    // cache comparators for performance
    Map comparatorCache = new HashMap();

    @Override 
    public ObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
	// all common initialization
	argumentOIs = ois;

	// clear comparator cache from previous invokations
	comparatorCache.clear();

	return checkAndReadObjectInspectors(ois);
    }

     /**
     * Utility method to check that an object inspector is of the correct type,
     * and returns its element object inspector
     * @param oi
     * @return
     * @throws UDFArgumentTypeException 
     */
    protected ListObjectInspector checkAndReadObjectInspectors(ObjectInspector[] ois) 
	    throws UDFArgumentTypeException, UDFArgumentException {
	// check number of arguments. We only accept two,
	// the list of struct to sort and the name of the struct field
	// to sort by
	if(ois.length != 2 ) {
	    throw new UDFArgumentException("2 arguments needed, found " + ois.length );
	}

	// first argument must be a list/array
	if (! ois[0].getCategory().equals(LIST)) {
		throw new UDFArgumentTypeException(0, "Argument 1"
			+ " of function " + this.getClass().getCanonicalName() + " must be " + Constants.LIST_TYPE_NAME
			+ ", but " + ois[0].getTypeName()
			+ " was found.");
	}

	// a list/array is read by a LIST object inspector
	loi = (ListObjectInspector) ois[0];

	// a list has an element type associated to it
	// elements must be structs for this UDF
	if( loi.getListElementObjectInspector().getCategory() != ObjectInspector.Category.STRUCT) {
		throw new UDFArgumentTypeException(0, "Argument 1"
			+ " of function " +  this.getClass().getCanonicalName() + " must be an array of structs " +
			" but is an array of " + loi.getListElementObjectInspector().getCategory().name());
	}

	// store the object inspector for the elements
	elOi = (StructObjectInspector)loi.getListElementObjectInspector();

	// returns the same object inspector
	return	loi;
    }

    // to sort a list , we must supply our comparator
    public  class StructFieldComparator implements Comparator {
	StructField field;

	public StructFieldComparator(String fieldName) {
	    field = elOi.getStructFieldRef(fieldName);
	}

	public int compare(Object o1, Object o2) {

	    // ok..so both not null
	    Object f1 =	elOi.getStructFieldData(o1, field);
	    Object f2 = elOi.getStructFieldData(o2, field);
	    // compare using hive's utility functions
	    return ObjectInspectorUtils.compare(f1, field.getFieldObjectInspector(), 
		    f2, field.getFieldObjectInspector());
	}
    }

    // factory method for cached comparators
    Comparator getComparator(String field) {
	if(!comparatorCache.containsKey(field)) {
	    comparatorCache.put(field, new StructFieldComparator(field));
	}
	return comparatorCache.get(field);
    }

    @Override
    public Object evaluate(DeferredObject[] dos) throws HiveException {
	// get list
	if(dos==null || dos.length != 2) {
	    throw new HiveException("received " + (dos == null? "null" :
		    Integer.toString(dos.length) + " elements instead of 2"));
	}

	// each object is supposed to be a struct
	// we make a shallow copy of the list. We don't want to sort 
	// the list in place since the object could be used elsewhere in the
	// hive query
	ArrayList al = new ArrayList(loi.getList(dos[0].get()));

	// sort with our comparator, then return
	// note that we could get a different field to sort by for every
	// invocation
	Collections.sort(al, getComparator( (String) dos[1].get()) );

	return al;
    }

    @Override
    public String getDisplayString(String[] children) {
	return  (children == null? null : this.getClass().getCanonicalName() + "(" + children[0] + "," + children[1] + ")");
    }

}
Maven dependencies for this code would be :
    
      org.apache.hadoop
      hadoop-core
      0.20.2
      provided
    
    
      org.apache.hadoop.hive
      hive-exec
      0.8.0
      provided
    



A JSON read/write SerDe for Hive
Roberto Congiu — Mon, 11 Jul 2011 21:54:14 +0000
Today I finished coding another SerDe for Hive which, with my employer’s permission, I published on github here: https://github.com/rcongiu/Hive-JSON-Serde.git.
Since the code is still fresh in my mind, I thought I’d write another article on how to write a SerDe, since the official documentation on how to do it it scarce and you’d have to read the hive code directly like I had to do.
How a SerDe is used in Hive
First of all, let’s have a look at how a SerDe interacts with Hive.

A SerDe is an interface composed by Serialization and Deserialization.

Let’s see the methods required for both interfaces:
public interface Serializer {
  void initialize(Configuration conf, Properties tbl) throws SerDeException;
  Class getSerializedClass();
  Writable serialize(Object obj, ObjectInspector objInspector) throws SerDeException;
}

public interface Deserializer {
  void initialize(Configuration conf, Properties tbl) throws SerDeException;
  Object deserialize(Writable blob) throws SerDeException;
  ObjectInspector getObjectInspector() throws SerDeException;
}
We can see that serialization for Hive means to somehow turn an Object into a Writable, while deserialization does the inverse. One way to do that would be to have objects to implement a certain interface, but hive designers chose another path, that is, to have an ObjectInspector, that is, an auxiliary object that can look at a Java object and make it digestible for Hive. An ObjectInspector does not carry any object information so it can be cached for a certain class of objects. This is done for performance reasons, you create one ObjectInspector and reuse it for all the records in your query.
The ObjectInspector
As I just said, the ObjectInspector lets hive look into a Java object and works as an adapter pattern, adatpting a Java Object as one of the 5 following abstractions, defined in the ObjectInspector interface:

PRIMITIVE
LIST
MAP
STRUCT
UNION

Here’s the code for the ObjectInspector interface:
package org.apache.hadoop.hive.serde2.objectinspector;
/**
 * ObjectInspector helps us to look into the internal structure of a complex
 * object.
 *
 * A (probably configured) ObjectInspector instance stands for a specific type
 * and a specific way to store the data of that type in the memory.
 *
 * For native java Object, we can directly access the internal structure through
 * member fields and methods. ObjectInspector is a way to delegate that
 * functionality away from the Object, so that we have more control on the
 * behavior of those actions.
 *
 * An efficient implementation of ObjectInspector should rely on factory, so
 * that we can make sure the same ObjectInspector only has one instance. That
 * also makes sure hashCode() and equals() methods of java.lang.Object directly
 * works for ObjectInspector as well.
 */
public interface ObjectInspector extends Cloneable {

  /**
   * Category.
   *
   */
  public static enum Category {
    PRIMITIVE, LIST, MAP, STRUCT, UNION
  };

  /**
   * Returns the name of the data type that is inspected by this
   * ObjectInspector. This is used to display the type information to the user.
   *
   * For primitive types, the type name is standardized. For other types, the
   * type name can be something like "list", "map", java class
   * names, or user-defined type names similar to typedef.
   */
  String getTypeName();

  /**
   * An ObjectInspector must inherit from one of the following interfaces if
   * getCategory() returns: PRIMITIVE: PrimitiveObjectInspector LIST:
   * ListObjectInspector MAP: MapObjectInspector STRUCT: StructObjectInspector.
   */
  Category getCategory();
}

We can see how the methods shared by all ObjectInspector are basically only introspection: get which Category the ObjectInspector is (Primitive, List, Map, etc.) and the type name (for instance, map).
An alternative to an adapter would be to have the object we want to manipulate implement some interfaces, but it would not be as flexible since we may not always have the luxury to extend a preexisting class.
It works well in our case. The basic idea for the SerDe is to use the JSON library from json.org (http://json.org/java/). The library can read a line and parse it to a JSONObject. A Hive data row is a Struct, since it’s a collection of columns of different types. This means we have to write a StructObjectInspector capable of reading a JSON Object. A JSONObject also can contain JSONArrays, which we’ll map to Hive Lists. For primitives, we can use the standard Hive object inspectors. Hive Maps also will work with JSONObjects. JSON does not distinguish between structs and maps, since in a JSON map you can have key/values of different types, while in a Hive map they have to be of the declared type (for instance map).
Let’s have a look at some ObjectInspectors:
  public interface PrimitiveObjectInspector 
    extends ObjectInspector {

  /**
   * The primitive types supported by Hive.
   */
  public static enum PrimitiveCategory {
    VOID, BOOLEAN, BYTE, SHORT, INT, LONG,
    FLOAT, DOUBLE, STRING, UNKNOWN
  };

  /**
   * Get the primitive category of
   * the PrimitiveObjectInspector.
   */
  PrimitiveCategory getPrimitiveCategory();

  /**
   * Get the Primitive Writable class
   * which is the return type of
   * getPrimitiveWritableObject() and 
   * copyToPrimitiveWritableObject().
   */
  Class getPrimitiveWritableClass();

  /**
   * Return the data in an instance of
   * primitive writable Object. If the Object
   * is already a primitive writable 
   * Object, just return o.
   */
  Object getPrimitiveWritableObject(Object o);

  /**
   * Get the Java Primitive class
   * which is the return type of
   * getJavaPrimitiveObject().
   */
  Class getJavaPrimitiveClass();

  /**
   * Get the Java Primitive object.
   */
  Object getPrimitiveJavaObject(Object o);

  /**
   * Get a copy of the Object in the same 
   * class, so the return value can be
   * stored independently of the parameter.
   * 
   * If the Object is a Primitive Java Object,
   * we just return the parameter
   * since Primitive Java Object is immutable.
   */
  Object copyObject(Object o);

  /**
   * Whether the ObjectInspector prefers
   * to return a Primitive Writable Object
   * instead of a Primitive Java Object. 
   * This can be useful for determining the
   * most efficient way to getting data out
   * of the Object.
   */
  boolean preferWritable();
}
We can see all the supported primitive types, as well as the methods that need to be implemented to retrieve the Java object containing the data.
The List Object Inspector has methods to get the nth element, to get the list length, and to retrieve the object collection:
/**
 * ListObjectInspector.
 *
 */
public interface ListObjectInspector extends ObjectInspector {

  // ** Methods that does not need a data object **
  ObjectInspector getListElementObjectInspector();

  // ** Methods that need a data object **
  /**
   * returns null for null list, out-of-the-range index.
   */
  Object getListElement(Object data, int index);

  /**
   * returns -1 for data = null.
   */
  int getListLength(Object data);

  /**
   * returns null for data = null.
   * 
   * Note: This method should not return a List object that is reused by the
   * same ListObjectInspector, because it's possible that the same
   * ListObjectInspector will be used in multiple places in the code.
   * 
   * However it's OK if the List object is part of the Object data.
   */
  List getList(Object data);
}
Now, let’s look at the StructObjectInspector, that is used to model a Hive row.
/**
 * StructObjectInspector.
 *
 */
public abstract class StructObjectInspector implements ObjectInspector {

  // ** Methods that does not need a data object **
  /**
   * Returns all the fields.
   */
  public abstract List<? extends StructField> getAllStructFieldRefs();

  /**
   * Look up a field.
   */
  public abstract StructField getStructFieldRef(String fieldName);

  // ** Methods that need a data object **
  /**
   * returns null for data = null.
   */
  public abstract Object getStructFieldData(Object data, StructField fieldRef);

  /**
   * returns null for data = null.
   */
  public abstract List<Object> getStructFieldsDataAsList(Object data);

  @Override
  public String toString() {
    StringBuilder sb = new StringBuilder();
    List fields = getAllStructFieldRefs();
    sb.append(getClass().getName());
    sb.append("<");
    for (int i = 0; i < fields.size(); i++) {
      if (i > 0) {
        sb.append(",");
      }
      sb.append(fields.get(i).getFieldObjectInspector().toString());
    }
    sb.append(">");
    return sb.toString();
  }
}

/**
 * StructField is an empty interface.
 * 
 * Classes implementing this interface are considered to represent a field of a
 * struct for this serde package.
 */
public interface StructField {

  /**
   * Get the name of the field. The name should be always in lower-case.
   */
  String getFieldName();

  /**
   * Get the ObjectInspector for the field.
   */
  ObjectInspector getFieldObjectInspector();

}
Now let’s write some code. For a struct, hive expects an object than can be cast to Array.

Notice that we are extending from StandardStructObjectInspector, coding only the JSON library-specific parts of the interface, while reusing all the boilerplate code to manage initialization, reading the field list, mapping them to positions in an array, etc. (code here http://svn.apache.org/repos/asf/hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardStructObjectInspector.java)
/**
 * This Object Inspector is used to look into a JSonObject object.
 * We couldn't use StandardStructObjectInspector since that expects 
 * something that can be cast to an Array.
 * @author rcongiu
 */
public class JsonStructObjectInspector extends StandardStructObjectInspector {

    public JsonStructObjectInspector(List structFieldNames, 
            List structFieldObjectInspectors) {
       super(structFieldNames, structFieldObjectInspectors);
    }

    @Override
    public Object getStructFieldData(Object data, StructField fieldRef) {
    if (data == null) {
      return null;
    }
    JSONObject obj = (JSONObject) data;
    MyField f = (MyField) fieldRef; 

    int fieldID = f.getFieldID();
    assert (fieldID >= 0 && fieldID < fields.size());

    try {
        return obj.get(f.getFieldName());
    } catch (JSONException ex) {
        // if key does not exist
        return null; 
    }
  }

    static List values = new ArrayList getStructFieldsDataAsList(Object o) {
        JSONObject jObj = (JSONObject) o;
        values.clear();

        for(int i =0; i< fields.size(); i ++) {
            try {
                values.add(jObj.get(fields.get(i).getFieldName()));
                 } catch (JSONException ex) {
                // we're iterating through the keys so 
                // this should never happen
                throw new RuntimeException("Key not found");
            }
        }

        return values;
    }
}
Hive supports lists, which can be mapped to a JSON Array object. The ObjectInspector to let hive access it looks like
public class JsonListObjectInspector extends StandardListObjectInspector {
    JsonListObjectInspector(ObjectInspector listElementObjectInspector) {
        super(listElementObjectInspector);
    }

     @Override
  public List getList(Object data) {
    if (data == null) {
      return null;
    }
    JSONArray array = (JSONArray) data;
    return array.getAsArrayList();
  }

  @Override
  public Object getListElement(Object data, int index) {
    if (data == null) {
      return null;
    }
    JSONArray array = (JSONArray) data;
    try {
        return array.get(index);
    } catch(JSONException ex) {
        return null;
    }
  }

  @Override
  public int getListLength(Object data) {
    if (data == null) {
      return -1;
    }
    JSONArray array = (JSONArray) data;
    return array.length();
  }
}
You can clearly see how this is an Adapter Pattern.
To implement Maps, we need to do some extra work, since the JSONObject object does not implement all the method needed by a Map. The objectInspector gas a getMap() method, which returns a Map. JSONObject has no such method so we’ d have to build the map on the spot, but I preferred to write a Map adapter arount JSONOBject. Code for it here.
public class JsonMapObjectInspector extends StandardMapObjectInspector {

    public JsonMapObjectInspector(ObjectInspector mapKeyObjectInspector, 
            ObjectInspector mapValueObjectInspector) {
        super(mapKeyObjectInspector, mapValueObjectInspector);
    }

  @Override
  public Map<?, ?> getMap(Object data) {
    if (data == null) {
      return null;
    }

    JSONObject jObj = (JSONObject) data;

    return new JSONObjectMapAdapter(jObj);
  }

  @Override
  public int getMapSize(Object data) {
    if (data == null) {
      return -1;
    }
     JSONObject jObj = (JSONObject) data;
    return jObj.length();
  }

  @Override
  public Object getMapValueElement(Object data, Object key) {
    if (data == null) {
      return -1;
    }

     JSONObject jObj = (JSONObject) data;
        try {
            return jObj.get(key.toString());
        } catch (JSONException ex) {
            // key does not exists -> like null
            return null;
        }
  }   
}
The SerDe
Now that we have all the ObjectInspectors we can write out SerDe.

As we saw at the beginning there are three entry points: initialize, serialize, deserialize.
Initialization
During initialization, hive gives the SerDe information on the table that it’s trying to access. For example, it needs the number of columns and their type. The Serde get a Propertyobject, as you can see below, the properties
Constants.LIST_COLUMNS
and
Constants.LIST_COLUMN_TYPES
contain a comma-separated list of column names and column type respectively.
List columnNames;
    List columnTypes;
    StructTypeInfo rowTypeInfo;
    StructObjectInspector rowObjectInspector;

    @Override
    public void initialize(Configuration conf, Properties tbl) throws SerDeException {
        LOG.debug("Initializing SerDe");
        // Get column names and types
        String columnNameProperty = tbl.getProperty(Constants.LIST_COLUMNS);
        String columnTypeProperty = tbl.getProperty(Constants.LIST_COLUMN_TYPES);

        LOG.debug("columns " + columnNameProperty + " types " + columnTypeProperty);

        // all table column names
        if (columnNameProperty.length() == 0) {
            columnNames = new ArrayList();
        } else {
            columnNames = Arrays.asList(columnNameProperty.split(","));
        }

        // all column types
        if (columnTypeProperty.length() == 0) {
            columnTypes = new ArrayList();
        } else {
            columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
        }
        assert (columnNames.size() == columnTypes.size());
Next, we build our ObjectInspector. To do that, we have to build a TypeInfo object through the appropriate factory. The TypeInfo object contains the list of fields, and their types. Since a Struct can contain other types, we can see in the StructTypeInfo code that it can traverse the struct and build TypeInfos for its fields as needed. The TypeInfo is also used to build the signature of the ObjectInspector and cache it.
        // Create row related objects
        rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNames, columnTypes);
        rowObjectInspector = (StructObjectInspector) JsonObjectInspectorFactory.getJsonObjectInspectorFromTypeInfo(rowTypeInfo);

    }
At this point, we’ve built the ObjectInspector, and we’re ready to Deserialize/Serialize
Deserialization
Deserialization is trivial: we get a Writable String with the JSON data and we have a JSONObject parse it. We then return that object. The ObjectInspector will tell Hive how to handle it:
 @Override
    public Object deserialize(Writable w) throws SerDeException {
        Text rowText = (Text) w;

        // Try parsing row into JSON object
        JSONObject jObj;
        try {
            jObj = new JSONObject(rowText.toString()) {

                /**
* In Hive column names are case insensitive, so lower-case all
* field names
*
* @see org.json.JSONObject#put(java.lang.String,
* java.lang.Object)
*/
                @Override
                public JSONObject put(String key, Object value)
                        throws JSONException {
                    return super.put(key.toLowerCase(), value);
                }
            };
        } catch (JSONException e) {
            // If row is not a JSON object, make the whole row NULL
            LOG.error("Row is not a valid JSON Object - JSONException: "
                    + e.getMessage());
            throw new SerDeException(e);
        }
        return jObj;
    }
So when hive wants to access a field myfield, will use the JsonStructObjectInspector.getStructFieldData() passing ot it the object and the myfield in the form of StructField reference, which pairs a field name with its corrispective ObjectInspector. This is done in JsonObjectInspectorFactory.getJsonObjectInspectorFromTypeInfo.
Serialization
You may be expecting that serialization is also simple, that we get a JSONObject and call its
toString
method…not quite. Your serializer could be getting any kind of object and the only assumption you can make is that its associated ObjectInspector will be some subclass of
StructObjectInspector
since that’s how a hive table is modeled. To serialize the record we’ll have to analyze the ObjectInspector, using it to instrospect the given data object and building our JSONObject as we go.
We can see how the serializer just calls
SerializeStruct
and turns the JSONObject into a
Text
@Override
    public Writable serialize(Object obj, ObjectInspector objInspector) 
                     throws SerDeException {
        // make sure it is a struct record
        if (objInspector.getCategory() != Category.STRUCT) {
            throw new SerDeException(getClass().toString()
                    + " can only serialize struct types, but we got: "
                    + objInspector.getTypeName());
        }

        JSONObject serializer =
            serializeStruct( obj, (StructObjectInspector) objInspector, columnNames);

        Text t = new Text(serializer.toString());

        return t;
    }
You can see how that works here:
    private JSONObject serializeStruct( Object obj,
            StructObjectInspector soi, List columnNames) {
        // do nothing for null struct
        if (null == obj) {
            return null;
        }

        JSONObject result = new JSONObject();

        List fields = soi.getAllStructFieldRefs();

        for (int i =0; i< fields.size(); i++) {
            StructField sf = fields.get(i);
            Object data = soi.getStructFieldData(obj, sf);

            if (null != data) {
                try {
                    // we want to serialize columns with their proper HIVE name,
                    // not the _col2 kind of name usually generated upstream
                    result.put((columnNames==null?sf.getFieldName():columnNames.get(i)),
                            serializeField(
                                data,
                                sf.getFieldObjectInspector()));
                } catch (JSONException ex) {
                   LOG.warn("Problem serialzing", ex);
                   throw new RuntimeException(ex);
                }
            }
        }
        return result;
    }
it loops through all the struct fields, and serializes them one by one. The heart of deserialization is the
serializeField()
method, which uses the ObjectInspector to read the data object and maps it to calls to the JSONObject we’re building:
    Object serializeField(Object obj,
            ObjectInspector oi ){
        if(obj == null) return null;

        Object result = null;
        switch(oi.getCategory()) {
            case PRIMITIVE:
                PrimitiveObjectInspector poi = (PrimitiveObjectInspector)oi;
                switch(poi.getPrimitiveCategory()) {
                    case VOID:
                        result = null;
                        break;
                    case BOOLEAN:
                        result = (((BooleanObjectInspector)poi).get(obj)?
                                            Boolean.TRUE:
                                            Boolean.FALSE);
                        break;
                    case BYTE:
                        result = (((ShortObjectInspector)poi).get(obj));
                        break;
                    case DOUBLE:
                        result = (((DoubleObjectInspector)poi).get(obj));
                        break;
                    case FLOAT:
                        result = (((FloatObjectInspector)poi).get(obj));
                        break;
                    case INT:
                        result = (((IntObjectInspector)poi).get(obj));
                        break;
                    case LONG:
                        result = (((LongObjectInspector)poi).get(obj));
                        break;
                    case SHORT:
                        result = (((ShortObjectInspector)poi).get(obj));
                        break;
                    case STRING:
                        result = (((StringObjectInspector)poi).getPrimitiveJavaObject(obj));
                        break;
                    case UNKNOWN:
                        throw new RuntimeException("Unknown primitive");
                }
                break;
            case MAP:
                result = serializeMap(obj, (MapObjectInspector) oi);
                break;
            case LIST:
                result = serializeArray(obj, (ListObjectInspector)oi);
                break;
            case STRUCT:
                result = serializeStruct(obj, (StructObjectInspector)oi, null);
                break;
        }
        return result;
    }
Notice how we reuse standard ObjectInspectors for primitives, while we have our own deserialization code for lists, maps and structs. Since we descend the object structure as we go, we can have arbitrarily deep nesting in the JSON data in both reading it and writing it.
Usage
Using the SerDe is simple, just add it to your queries (or in your hive classpath) and do something like this:
CREATE TABLE json_test (
           name string,
           favorite_foods list,
           subject_grade map
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS SEQUENCEFILE;
This will be able to read json data like
{ "name":"Roberto", "favorite_foods":["Sushi","Pizza"], "subject_grade":{"math":good","histori":"so so"}}
using a query like
SELECT name, favorite_foods[0], subject_grade['math'] FROM json_test;
Source Code
available here: https://github.com/rcongiu/Hive-JSON-Serde/

Pre-built binary available in the Downloads section.



Writing a Hive SerDe for LWES event files
Roberto Congiu — Tue, 27 Oct 2009 21:52:12 +0000
I am currently working to set up an OLAP data warehouse using Hive on top of Hadoop. We have a considerable amount of data  that comes from the ad servers on which we need to perform various kinds of analysis.
Writing a map-reduce job is not difficult in principle – it’s just time consuming and requires the skills of a trained java engineer, which wouldn’t be needed were we using SQL. That’s where hive comes in: it allows us to query an hadoop data store using a flavor of SQL.
 
Hive stores data using common hadoop formats: text and sequence files.

Unfortunately, our data is not in either of those formats and it would be unpractical to reformat it. Fortunately Hive is very extendable: you can plug your own file format, even your own serialization/deserialization. What’s the difference between the two ? 

 

File Format: determines how data is stored, from the file set to the parsing of single key/value pairs: compression, binary format, splits, etc.
SerDe: maps how you deserialize a key/value pair to a set of columns belonging to a hive table

At the time I wrote this not, it’s not mentioned in the hive wiki that you can actually specify your own file format. If you dig in the code and look in the Query Language grammar (ql/src/java/org/apache/hadoop/hive/ql/Hive.g) you’ll see that you can actually pick your input and output format:
     KW_STORED KW_AS KW_SEQUENCEFILE  -> TOK_TBLSEQUENCEFILE
      | KW_STORED KW_AS KW_TEXTFILE  -> TOK_TBLTEXTFILE
      | KW_STORED KW_AS KW_RCFILE  -> TOK_TBLRCFILE
      | KW_STORED KW_AS KW_INPUTFORMAT inFmt=StringLiteral KW_OUTPUTFORMAT outFmt=StringLiteral

So in the CREATE TABLE you could specify STORED AS INPUTFORMAT ‘com.yoursite.YourInputFormatClass’ OUTPUTFORMAT=’com.yoursite.YourOutputFormatClass’.
Writing your own input/output format is not trivial since you’ll have to take care of how to create the splits from the input. Having the splits of the right size it’s very important for Hadoop performance so you should write your own splitting logic only if you know what you’re doing.
Should you decide to write your own input format, how do you plug it in ? Hive will add to its classpath all the files in a directory specified by either the HIVE_AUX_JARS_PATH environment variable or the –auxpath command line option for the hive shell. This is also how you add a SerDe, and an UDF function.
In this post I’ll just cover the writing of a SerDe. I will cover the InputFormat in another post.
SerDe tells Hive how to map the column names in the table to values within the Writable object. Note that the Writable Object may be an arbitrarily complex object hierarchy.
To efficiently handle introspection, Hive implements its own, optimized for high-performance repetitive tasks. It uses ObjectInspectors that do not keep state and are immutable, so we just need one per type.

Types inherit from org.apache.hadoop.hive.serde2.typeinfo.TypeInfo and can belong to one of these Categories:
public interface ObjectInspector {
  public static enum Category {
    PRIMITIVE, LIST, MAP, STRUCT
  };

Each one of these 4 types has its own interface that extends ObjectInspector:

PrimitiveObjectInspector
ListObjectInspector
MapObjectInspector
StructObjectInspector

 
Since our file format is basically a Map, I tried to implement a SerDe with a MapObjectInspector that reads into our format. Unfortunately, it didn’t work. It looks like in the execution, Hive expects to have a Struct, not a map. So your table record HAS TO be a struct, with the other categories being used just for columns/attributes.

From a theoretical point of view, it makes sense. When we issue a SELECT statement, we’re expecting a result that it’s a set of values that are of different type…which is the definition of a struct. For that we cannot use a primitive nor lists nor maps since they are homogenous collections (List, Map).
So, how does the SerDe work ?

The SerDe interface is simply the union of two interface: serialization and deserialization. Since hadoop expects a Writable, we serialize/deserialize to/from Writable.
public interface Serializer {
  public void initialize(Configuration conf, Properties tbl) throws SerDeException;
  public Class getSerializedClass();
  public Writable serialize(Object obj, ObjectInspector objInspector) throws SerDeException;
}

public interface Deserializer {
  public void initialize(Configuration conf, Properties tbl) throws SerDeException;
  public Object deserialize(Writable blob) throws SerDeException;
  public ObjectInspector getObjectInspector() throws SerDeException;
}

To know how to do serialization/deserialization we do need to know how to map the columns to whatever the internal structure of our object is and also eventually do our type conversions. Since serialize/deserialize methods are called for every record we should optimize this conversion as much as possible, pre-building this mapping in efficient data structures when the SerDe is started. This is can be done in the initialize() method of the SerDe.
In my case, I use these four attributes:

List columnNames;
List columnTypes;
TypeInfo rowTypeInfo;
ObjectInspector rowObjectInspector;

As seen below, we retrieve the list of columns and types from properties, and we build the TypeInfo (hive’s way to categorize the columns) from utility methods. Particularly, we create the row type info as a struct info, with the given list of columns and type ( rowTypeInfo = TypeInfoFactory.getStructTypeInfo(columnNames, columnTypes) )
     // Get column names and sort order
        String columnNameProperty = tbl.getProperty("columns");
        String columnTypeProperty = tbl.getProperty("columns.types");
        if (columnNameProperty.length() == 0) {
            columnNames = new ArrayList();
        } else {
            columnNames = Arrays.asList(columnNameProperty.split(","));
        }
        if (columnTypeProperty.length() == 0) {
            columnTypes = new ArrayList();
        } else {
            columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
        }
        assert (columnNames.size() == columnTypes.size());

        if (tbl.containsKey("lwes.event_name")) {
            allEventName = tbl.getProperty("lwes.event_name");
        }

        // Create row related objects
        rowTypeInfo = TypeInfoFactory.getStructTypeInfo(columnNames, columnTypes);
        rowObjectInspector = (StructObjectInspector) TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(rowTypeInfo);
        row = new ArrayList(columnNames.size());

        for (int i = 0; i < columnNames.size(); i++) {
            row.add(null);
        }

        // Get the sort order
        String columnSortOrder = tbl.getProperty(Constants.SERIALIZATION_SORT_ORDER);
        columnSortOrderIsDesc = new boolean[columnNames.size()];
        for (int i = 0; i < columnSortOrderIsDesc.length; i++) {
            columnSortOrderIsDesc[i] = (columnSortOrder != null && columnSortOrder.charAt(i) == '-');
        }

So, above we retrieved the hive columns and defined what our rows are. Now we need t define how we want to map those columns to our data. In my case, I had to map lwes events (http://www.lwes.org). Lwes is an event system that produces event files that are basically key/value pairs. It’s a binary format optimized for size and it’s free and open source. Every event has a type, and a different set of possible keys. My first idea was to just map hive column names to lwes keys, but the former are case insensitive while the latter are not. So we want to have a natural mapping for the columns that are lowercase, but we can then use SerDe properties to manually map the one that are not. Also, we may want multiple event types in a table, or we may want just one event type. In the first case, we do want to specify the event type we care about as well. That happened above in the statement allEventName = tbl.getProperty("lwes.event_name"). But let’s see how to specify the mapping :
CREATE TABLE blah (
     ….....
    u_header_ua string,
    u_ox_url string,
     …....
 )
PARTITIONED BY(dt STRING)
 ROW FORMAT SERDE 'org.openx.data.hive.journalserde.EventSerDe'
WITH SERDEPROPERTIES (
        'lwes.event_name'='My::Event',
        'sender_ip'='SenderIP',
        'sender_port'='SenderPort',
        'receipt_time'='ReceiptTime',
        'site_id'='SiteID')

So, u_header_ua will be automatically mapped to the event key with the same name. “sender_ip” will be mapped to the key “SenderIP”. We want to store these mappings in a hash to be used by the Serializer and Deserializer. Since we have to put the data in a row record that’s defined as ArrayList , we need to map the event key to We’ll store them in Map> fieldsForEventName = new HashMap>(). FieldAndPosition is a simple data structure that stores the Lwes key along with its position in the ArrayList we need to return.
Let’s see how it works:
     // take each hive column and find what it maps to into the event list
        int colNumber = 0;
        for (String columnName : columnNames) {
            String fieldName;
            String eventName;
            // column not defined in SerDe properties and no event specified.
            if (!tbl.containsKey(columnName) && allEventName == null) {
                LOG.debug("Column " + columnName + 
                        " is not mapped to an eventName:field through SerDe Properties");
                continue;
            } else if (allEventName != null) {
                // no key, but in a single-type event file specified in lwes.event_name
                eventName = allEventName;
                fieldName = columnName;
            } else {
                // we found a special mapping
                String fullEventField = tbl.getProperty(columnName);
                String[] parts = fullEventField.split("::");

                // we are renaming the column
                if (parts.length < 1 || (parts.length == 1 && allEventName != null)) {
                    System.out.println("Malformed EventName::Field " + fullEventField);
                    continue;
                } else if (parts.length == 1 && allEventName != null) {
                    // adds the name. We're not specifying the event.
                    fieldName = parts[0];
                    eventName = allEventName;
                } else {
                    fieldName = parts[parts.length - 1];
                    eventName = fullEventField.substring(0, fullEventField.length() - 2 - fieldName.length());
                }
                LOG.debug("Mapping " + columnName + " to EventName " + eventName +
                        ", field " + fieldName);
            }

            if (!fieldsForEventName.containsKey(eventName)) {
                fieldsForEventName.put(eventName, new LinkedList());
            }

            fieldsForEventName.get(eventName).add(new FieldAndPosition(fieldName, colNumber));

            colNumber++;
        }

What the deserialize method does is return a
    ArrayList row;

Let’s see how we do it, using the data structures we’ve built so far:
    @Override
    public Object deserialize(Writable w) throws SerDeException {
        LOG.debug("JournalSerDe::deserialize");

        if (w instanceof EventListWritable) {
            EventListWritable ew = (EventListWritable) w;
            for (Event ev : ew.getEvents()) {
                deserializeEvent(ev);
            }
        } else if (w instanceof EventWritable) {
            EventWritable ew = (EventWritable) w;
            Event ev = ew.getEvent();
            deserializeEvent(ev);
        } else {
            throw new SerDeException("I don't know how to deserialize " + w.getClass().getName());
        }
        return row;
    }

Here are the gory details of deserializing an event, including type conversion:
    public void deserializeEvent(Event ev) throws SerDeException {
        if (this.fieldsForEventName.containsKey(ev.getEventName())) {

            for (FieldAndPosition fp : fieldsForEventName.get(ev.getEventName())) {
                
                TypeInfo type = columnTypes.get(fp.getPosition());
                
                LOG.debug("Deserializing " + columnNames.get(fp.getPosition()));

                try {
                    if (type.getTypeName().equals(Constants.STRING_TYPE_NAME)) {              
                        if ( ev.get(fp.getField()) != null )
                            row.set(fp.getPosition(), ev.get(fp.getField()).toString());
                        else
                            row.set(fp.getPosition(), null);
                    } else if (type.getTypeName().equals(Constants.INT_TYPE_NAME)) {
                       row.set(fp.getPosition(), ev.getInt32(fp.getField()));
                    } else if (type.getTypeName().equals(Constants.SMALLINT_TYPE_NAME) ||
                               type.getTypeName().equals(Constants.TINYINT_TYPE_NAME)) {
                       row.set(fp.getPosition(), new Short(ev.getInt16(fp.getField())));
                    } else if (type.getTypeName().equals(Constants.BIGINT_TYPE_NAME)) {
                        row.set(fp.getPosition(), ev.getInt64(fp.getField()));
                    } else if (type.getTypeName().equals(Constants.BOOLEAN_TYPE_NAME)) {
                        row.set(fp.getPosition(), ev.getBoolean(fp.getField()));
                    } else if (type.getTypeName().equals(Constants.DATETIME_TYPE_NAME)) {
                        row.set(fp.getPosition(), ev.get(fp.getField()));
                    } else if (type.getTypeName().equals(Constants.DATE_TYPE_NAME)) {
                        row.set(fp.getPosition(), ev.get(fp.getField()));
                    } else if (type.getTypeName().equals(Constants.FLOAT_TYPE_NAME)) {
                        throw new SerDeException("Float not supported");
                    } else if (type.getTypeName().equals(Constants.DOUBLE_TYPE_NAME)) {
                        throw new SerDeException("Double not supported");
                    } else if (type.getTypeName().equals(Constants.TIMESTAMP_TYPE_NAME)) {
                        row.set(fp.getPosition(), ev.get(fp.getField()));
                    } 
                } catch (NoSuchAttributeException ex) {
                    LOG.error("No such attribute " + fp.getField() +
                            " in event " + ev.getEventName() +
                            " for column " + columnNames.get(fp.getPosition()));
                } catch (AttributeNotSetException ex) {
                    row.set(fp.getPosition(), null);
                    LOG.debug("Not set:  attribute " + fp.getField() +
                            " in event " + ev.getEventName() +
                            " for column " + columnNames.get(fp.getPosition()));
                } catch (Exception ex) {
                    LOG.error("Exception " + ex + " processing " + fp.getField() +
                            " in event " + ev.getEventName() +
                            " for column " + columnNames.get(fp.getPosition()));
                }
            }
        }
    }

We also need some utility methods:
    @Override
    public ObjectInspector getObjectInspector() throws SerDeException {
        LOG.debug("JournalSerDe::getObjectInspector()");
        return rowObjectInspector;
    }

    @Override
    public Class getSerializedClass() {
        LOG.debug("JournalSerDe::getSerializedClass()");
        return EventWritable.class;
    }

If you’re interested to the complete code, it is available on the lwes.org subversion repository here: https://lwes.svn.sourceforge.net/svnroot/lwes/contrib/lwes-hive-serde/trunk/.



Data Warehousing Books
Roberto Congiu — Tue, 27 Oct 2009 21:51:11 +0000
With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that’s expnding even in the recession.  It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar to linux 10 years ago. One of this technologies is Hadoop, a distributed filesystem and data processing framework based on Google’s Map/Reduce paper. Hadoop powers Yahoo! Search, Facebook and many other sites’ data warehouses.  If you’re thinking about learning more about Data Warehousing, I have 2 books to recommend. The first one covers the basics concepts and terminology in data warehousing, the second covers the new kid on the block, Hadoop.




The Data Warehouse ETL Toolkit
            This book covers the general concepts and the terminology you need to know – there’s no code, nothing specific to any system. It assumes you’re using some kind of relational database, and some kind of tool to do your ETL (Extract, Transform, Load). Walks you over all the possible processes your data may need to go through, as well as the possible problems.




Hadoop: the definitive guide
            This is an excellent book that provides an in-depth theoretical explanation of hadoop and its concepts, and its internals. It has a lot of material and it’s going to be useful to the novice as well as to the expert. It covers all the internals of Hadoop – input formats, compression, splits – as well as the more mundane and practical aspects like installation, administration, monitoring. It doesn’t cover much of other tools that are based on hadoop (hive, hbase) but it does give you an idea of how they relate to each other.







Joins in Hadoop using CompositeInputFormat
Roberto Congiu — Sun, 07 Jun 2009 21:49:07 +0000
One of the first questions that a ‘traditional’ ETL engineer asks when learning hadoop is, “How do I do a join ?”
For instance, how can we do in hadoop something like querying for the names of all employees who are in a California city:
SELECT e.name, c.name from employees e INNER JOIN cities c
    on e.city_id = c.id AND c.state ='CA'

If one dataset is big and the other is small enough that you’re sure it will fit into memory, you can take advantage of the Distributed Cache feature that can copy around a berkeleyDB or any other file-based hash that you can use for your join in either the map or the reducer (wherever it makes sense).
The method I discuss here is to use CompositeInputFormat which you can use if:

all of the files you want to join are sorted
 all of them have the same joining key
 the files are too big to do the join using DistributedCache

The way it work is similar in principle to mergesort. If you have n files sorted by their join key, you can combine them easily, reading the records one by one from each files so that you are always reading the records with the same key.
This is how CompositeInputFormat works. It will read your files and deliver a TupleWritable object to the mapper.

To configure a job beside setting the CompositeInputFormat class as input format, you have to specify the join expression. You can specify outer or inner. I haven’t seen a way to do left or right outer.
jobConf.setInputFormat(CompositeInputFormat.class);
jobConf.set("mapred.join.expr", CompositeInputFormat.compose(
                "outer", KeyValueTextInputFormat.class,
                FileInputFormat.getInputPaths(jobConf)));

 You also have to specify what is actually the format of the files you’re reading.  That’s why you see KeyValueTextInputFormat.class in the code above.

CompositeInputFormat will delegate picking the key to that class. In this example, it will take the first column in a tab-separated file, as it is default behavior for KeyValueTextInputFormat.
The cool thing about CompositeInputFormat is that you can pick your i
What next ? If you look at the example in the hadoop distribution, src/examples/org/apache/hadoop/examples/Join.java, you see that IdentityMapper and IdentityReducer are used to just output the join record.

What if you need to manipulate it  though?

For example, a common task in ETL is dumping data periodically from a source, and build a file with the changes to keep another system that may be remote, so we need to minimize the bandwidth usage.

 Assuming we have a ‘old’ and a ‘new’ file, we need to transmit:

– the records that are in ‘new’ and not in ‘old’

– the records that are both in ‘new’ and ‘old’, and whose content is different

– the records that are in ‘old’ but not in ‘new’, marked for deletion.
Let’s assume our final format is :
key1 val1
key2 val2 
key3 -

where the first 2 records are added/updated, the third removed.
The mapper will take care of joining ‘old’ and ‘new’. We can then just leave the IdentityMapper.

The reducer will pick up a  (Text,Iterator). There should be just one value per key, if not it probably means the input files are not actually sorted.
        @Override
        public void reduce(Text key, Iterator values,
                OutputCollector output, Reporter reporter)
                throws IOException {
            // should be just one value
            int count = 0;
            while(values.hasNext()) {
                TupleWritable val = values.next();

                if(count>0)
                    throw new IOException("Can't have 2 tuples for same key, there must be something wring in the join");

                boolean hasOld = val.has(0);
                boolean hasNew = val.has(1);

                if(hasNew && !hasOld ) {
                    // new record
                    output.collect(key,(Text)val.get(1));
                } else if( hasNew && hasOld ) {
                    if(! val.get(0).equals(val.get(1))) {
		// modified record
                        output.collect(key,(Text)val.get(1));
                    }
                } else if(hasOld && ! hasNew) {
                    Text u = new Text("-" + ((Text)val.get(0)).toString());
	        // remove
                    output.collect(key, u);
                }
                count++;
            }
        }

TupleWritable keeps the records from the files it’s joining in the same order. Since we’re doing an outer join, we’re using tuple.has() to see if the nth component of the tuple is actually present, and tuple.get() to actually get it.

Don’t forget to set the appropriate classes for output:
        jobConf.setOutputFormat(TextOutputFormat.class);
        jobConf.setOutputValueClass(TupleWritable.class);

This is not the most efficient way, it’s just an example. It would actually be more efficient to do the checks in the mapper with no reducer at all. This way we save the shuffle/sort on the reducer.

user	event	time
user1	page1	10:12session1 (new session)
user1	page2	10:20session1 (same session, 8 minutes from last event)
user1	page1	11:13session1 (same session, 53 minutes from last event)
user1	page3	14:12session2 (new session, 3 hours after last event)

user	ts	session	newsession
user1	1508863564166	f237e656-1e..	f237e656-1e..
user1	1508864164166	null	f237e656-1e..
user1	1508864464166	null	f237e656-1e5..
user1	1508871964166	null	51c05c35-6f..
user1	1508873164166	null	51c05c35-6f..
user2	1508863864166	null	2c16b61a-6c..
user2	1508864464166	null	2c16b61a-6c..