Part 2 - Snowplow Stream Enrichment
Part 2 - Snowplow Stream Enrichment

In the previous post, we discussed how to set up the Scala Stream Collector to collect analytics events secured through HTTPS and publish them to a Kinesis Stream.

In this post we're going to jump into setting up Snowplow Stream Enrichment through Terraform.

The Stream Enrich process will allow you to take the raw events coming through your Snowplow Collector and add useful metadata and structure to them. This'll help to segment your audience when you're analyzing the data in the future!

It may be helpful to check out the Github repo associated with this series, while following along! If you're just starting with this chapter, it may also be helpful to check out the rest of the series


If you're interested in topics like this, why not subscribe to the blog?

Directory Structure

If you've followed along with the previous post, you're already on your way to having your Snowplow Streaming Enrich setup complete. We're going to follow a lot of the same structure for this part of the process.

At the end of this post, final directory structure will look something like this:

- iam-users
- kinesis-stream
- snowplow-collector
- snowplow-enrich
    - ami.tf
    - config.hocon.tpl
    - main.tf
    - resolver.js.tpl
    - security_groups.tf
    - variables.tf
- main.tf
...

Let's jump into what each of these files does! They'll all look pretty familiar to our snowplow-collector module.

ami.tf

We're going to be using Ubuntu for our enrichment host. This data source will go and find the latest Ubuntu AMI for your respective region.

Nothing too special about this file - Let's move on to the next one!

security_groups.tf

Notice that we're only opening up port 22 on our inbound connections list for this enrichment process. The reason for this is that we don't need any HTTP connection here like we'd need for the collector.

We also open up outbound connections so that the server can download the necessary JAR files from Snowplow's servers.

variables.tf

This file is best explained by talking about what each of the variables represent:

  • stream_in is the Kinesis stream of raw events coming from our Snowplow Collector. This will be the same as the good output stream that we set up in our collector.

  • good_stream_out is the Kinesis stream where we'll push events that were handled by our enrichment correctly. These are the events that were in an expected format and not too large to process.

  • bad_stream_out is the Kinesis stream where we'll push events that were either in an unreadable format or were too large to process effectively. Administrators can use this to keep an eye on their incoming events and make sure that nothing unexpected is happening.

  • enrich_version allows us to select which version of Snowplow Enrich we'd like to install

  • machine_ip is the IP of your current machine. We're using this so that we can lock down SSH to be inaccessible to the outside world. It may be a little overkill, but it'll secure our user's data just a bit more.

  • key_pair_name is the name of the SSH key pair that we'd like to secure the server with.

  • key_pair_loc is the location of the key pair (PEM file) on our hard drive

  • aws_region is the region in which the Kinesis Streams and Snowplow Enrichment Process are hosted

  • operator_access_key and operator_secret_key are the credentials the enrichment process will use to communicate with the AWS Kinesis API.

Enrichment Config files

This is where a majority of the customization will take place for our enrichment process. We can use this to add custom resolvers to our enrichment process and to write to different types of streams.

We're not going to go into that depth in this text, but this will be a good foundation for further customization!

config.hocon.tpl

We're using this as a Terraform template file to fill in the details of the streams that we've set up. We're setting up the input stream, the output streams and the access keys that the enrichment process should use to access them.

resolver.js.tpl

This'll set us up to use the official Iglu Repository for a JSON schema. This file can be modified to add your own JSON schemas for your enrichment process.

snowplow-enrich/main.tf

Here's where our module comes together and creates the server we'll use to host the Snowplow enrichment process!

We first load up the template files that we created previously so that we can save them to the server. We fill in all the variables they need for the string interpolation.

We create a t2.micro Ubuntu server with our key pair to secure the SSH connection.

There's a lifecycle.ignore_changes on the AMI so that we aren't asked to update the server whenever a new Ubuntu AMI comes out (as the data source would do by default).

security_groups sets up our firewall and tags.Name is there to conveniently add a name that lets us identify the server in our list of EC2 instances.

Our first remote-exec provisioner will install Java on the server so that we can run the enrichment application.

The second remote-exec provisioner:

  1. Writes our configuration files to the hard drive

  2. Downloads the Snowplow Streaming Enrich application and unzips it

  3. Adds a line to our crontab file that starts the enrichment application when the server boots

  4. Starts the enrichment application and disconnects it from our SSH session via nohup so that we can safely close the connection.

main.tf

Everything is in place for our module! Let's instantiate the module within our root main.tf file. Along with the rest of the existing Snowplow stack.

The additions we've made to this file are:

  • Create analytics-enrich-good and analytics-enrich-bad kinesis streams

  • Instantiate the snowplow-enrich module we've created with the proper variables.

Moving forward

In our next post, we'll create a Snowplow Sink that will let us start crunching the data that's moving through our new analytics stack!

In future posts we'll also detail how to create some applications that can read from these Kinesis streams and use them to analyze and act on your analytics data in real time.