Part 1 - The Snowplow Collector

Part 1 - The Snowplow Collector

In the previous post, we discussed the benefits of using Snowplow Analytics and why it's important for modern businesses to own their analytics data.

In this post we're going to jump into setting up a Snowplow Collector through Terraform. We're going to be using the Scala Stream Collector to do this and we'll be securing our collector behind SSL to keep our customer's data safe.

It may be helpful to check out the Github repo associated with this series, while following along!


If you're just starting with this chapter, it may also be helpful to check out the other Snowplow Analytics tutorials or learn more about using Terraform.


What is a collector?

The collector is an HTTP endpoint that your apps can communicate with to store analytics data. It is designed to be reliable and to store large quantities of data that are thrown at it.

For example: If a user clicks a link on your website, that'd be an event. This event will be sent by your application to a collector that will hand it over to your backend analytics system for further processing.

Which collector are we using?

There a number of Snowplow Collectors to choose from. In this post, we'll be setting up a Scala Stream Collector.

The reason we're choosing this collector, is that it takes away some of the moving parts in Snowplow, that would need to be configured. The less moving parts, the most stable our analytics system and the more time we can spend doing what matters.

We're also going to use the Scala Stream Collector because of the fact that it can be used for real-time analytics. In the future, I might write a post about how to make use of these real-time aspects of it. But in the meantime, it's a feature that can't quite be reached through the batch-processing techniques used historically and it can be really nice.

Our final directory structure will look like this:

- iam-users
- kinesis-stream
- snowplow-collector
- main.tf
- outputs.tf
- variables.tf

Let's jump into this!

Want to read more posts like this? Subscribe to the blog! Sign in with Github

variables.tf

I think it's best to start out with the variables.tf and providers.tf files. They're how the whole thing can be configured and will lead you being able to use Terraform to deploy the rest of the infrastructure.

Here's the code from our variables.tf file:

# AWS Config

variable "aws_access_key" {
  default = "YOUR_ACCESS_KEY_ID"
}

variable "aws_secret_key" {
  default = "YOUR_SECRET_KEY"
}

variable "aws_region" {
  default = "us-west-2"
}

variable "key_pair_name" {
  default = "your_key_name"
}

variable "key_pair_location" {
  default = "~/Documents/your_key_name.pem"
}

variable "ssl_acm_arn" {
  default = "arn:aws:acm:YOUR_AWS_REGION:YOUR_AWS_ID:certificate/YOUR_CERTIFICATE_ID"
}

aws_access_key, aws_secret, aws_region are all pretty self-explanatory. If they aren't, you can check out my introduction to AWS.

key_pair_name is the name of the Key Pair that's stored in AWS that we'll use to SSH into the machine

key_pair_location is where on the hard drive we can find the private key associated with the Key Pair name we used earlier.

ssl_acm_arn is the ARN of the ACM (Amazon Certificate Manager) we'd like to use for SSL. This will make it so we can use HTTPS in our load balancer. This is important because of the fact that we want user usage data to remain encrypted in-transit to our servers.

providers.tf

The providers.tf file is dedicated to connecting to AWS and creating the resources we're about to define.

Run terraform init to download the proper AWS provider files.

IAM User

Now let's create the user that our collector will run as. We will do so by defining an IAM users module in our configuration. By putting this into a module, we ensure that we can expand upon it later without cluttering up the root directory.

The directory structure will look like this:

- iam-users
    - outputs.tf
    - snowplow-operator.tf

Let's break down what each of those files do.

snowplow-operator.tf

resource "aws_iam_user" "snowplow-operator" {
  name = "snowplow-operator-datapipeline"
  path = "/system/"
}

resource "aws_iam_access_key" "snowplow-operator" {
  user = "${aws_iam_user.snowplow-operator.name}"
}

resource "aws_iam_user_policy" "snowplow-operator" {
  name = "snowplow-policy-operator"
  user = "${aws_iam_user.snowplow-operator.name}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "kinesis:*"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

This file essentially follows these steps:

  1. Create the IAM user (snowplow-operator-datapipeline)

  2. Attach an Access Key to the user

  3. Attach a policy to the user that allows them to manipulate Kinesis streams

outputs.tf

output "operator-access-key" {
  value = "${aws_iam_access_key.snowplow-operator.id}"
}

output "operator-secret-key" {
  value = "${aws_iam_access_key.snowplow-operator.secret}"
}

Really the only thing we care about for this user is their Access Key ID and their Secret Access Key. Using these keys, our collector will be able to identify as this user when speaking to the AWS API. Through identifying as the user, the collector will have access to our Kinesis streams.

Kinesis Streams

We're going to turn Kinesis streams into a module because it has some unnecessary information that we can abstract out, to prevent duplication. In this case, the fact that we want shard-level metrics.

The directory structure is going to look like this:

- kinesis-stream
    - main.tf
    - outputs.tf
    - variables.tf

Shard-level metrics are nice to have because they will let us keep track of how much data we're sending through the system. But it it can clutter up the configuration and distract from its main purpose.

variables.tf

variable "name" { type = "string" }
variable "shard_count" { default = 1 }
variable "retention_period" { default = 48 }

The only inputs we care about for now are name, shard_count and retention_period. All of them (besides name) have some sane defaults.

main.tf

resource "aws_kinesis_stream" "main" {
  name             = "${var.name}"
  shard_count      = "${var.shard_count}"
  retention_period = "${var.retention_period}"

  shard_level_metrics = [
    "IncomingBytes",
    "OutgoingBytes",
  ]
}

Let's define the AWS Kinesis Stream resource with the variables we just created and shard-level metrics enabled.

outputs.tf

output "stream-name" { value = "${aws_kinesis_stream.main.name}" }

The only output we care about is the stream's name. We'll use this to identify it and connect it with other resources.

Snowplow Collector Module

Now that we've set up the foundational modules, we can jump into writing our Snowplow collector module. This'll be the largest of our modules.

The directory structure will look like this:

- snowplow-collector
    - ami.tf
    - config.hocon.tpl
    - elb.tf
    - main.tf
    - outputs.tf
    - security_groups.tf
    - variables.tf

variables.tf

variable "machine_ip" { type = "string" }
variable "key_pair_name" { type = "string" }
variable "key_pair_loc" { type = "string" }
variable "aws_region" { type = "string" }

variable "good_stream_name" { type = "string" }
variable "bad_stream_name" { type = "string" }
variable "collector_version" { default = "0.12.0"}

variable "operator_access_key" { type = "string" }
variable "operator_secret_key" { type = "string" }

variable "ssl_acm_arn" { type = "string" }

This module is going to take in a decent number of variables. Let's go through what each of them represent:

machine_ip - The IP of the machine that's running Terraform. The configuration keeps track of this in order to set up a very specific SSH rule that allows us to provision the instance after it is created.

key_pair_name and key_pair_loc - Details of the key pair that will allow us to SSH into the machine.

aws_region - The region in which we are creating these resources

good-stream-name - The stream in which our collector will place the good events. The events that aren't too large and follow the correct format.

bad-stream-name - The stream in which our collector will place bad events. These events are either too large or in an unknown format.

collector-version - The version of the Scala Stream Collector that we'd like to install on the machine

operator_access_key and operator_secret_key - We created a Snowplow Operator IAM User in an earlier part of this post. We need the access key for that user in order to allow the collector to operate on Kinesis streams.

ssl_acm_arn - The ARN at which we can find our SSL key (stored in Amazon Certificate Manager) for the domain on which the collector will sit.

ami.tf

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["099720109477"] # Canonical
}

We've gone through this configuration before. This data source will find the most recent AMI for Ubuntu 16.04 that's available.

security_groups.tf

resource "aws_security_group" "snowplow-collector" {
  name        = "snowplow-collector"
  description = "Allow SSH inbound, all HTTP inbound, and all outbound traffic"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["${chomp(var.machine_ip)}/32"]
  }

  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Here we create the rules governing inbound and outbound communication for the collector server:

  • We're allowing inbound SSH connections from our computer's IP to help us provision the server after creation

  • We're allowing inbound HTTP connections from all machines

  • We're allowing all outbound connections so that we can download the proper packages and snowplow collector files

main.tf

# Configuration file
data "template_file" "collector-config" {
    template = "${file("${path.module}/config.hocon.tpl")}"

    vars {
        good-stream-name  = "${var.good_stream_name}"
        bad-stream-name   = "${var.bad_stream_name}"
        aws-region        = "${var.aws_region}"
        access-key        = "${var.operator_access_key}"
        secret-key        = "${var.operator_secret_key}"
    }
}

# EC2 Server
resource "aws_instance" "collector" {
  ami             = "${data.aws_ami.ubuntu.id}"
  instance_type   = "t2.micro"
  key_name        = "${var.key_pair_name}"

  security_groups = [
    "${aws_security_group.snowplow-collector.name}",
  ]

  tags {
    Name = "snowplow-collector"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y unzip openjdk-8-jdk",
    ]

    connection {
      type          = "ssh"
      user          = "ubuntu"
      private_key   = "${file("${var.key_pair_loc}")}"
    }
  }

  provisioner "remote-exec" {
    inline = [
      "cat <<FILEXXX > ~/config.hocon",
      "${data.template_file.collector-config.rendered}",
      "FILEXXX",
      "wget http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_${var.collector_version}.zip",
      "unzip snowplow_scala_stream_collector_${var.collector_version}.zip",
      "echo \"@reboot  java -jar /home/ubuntu/snowplow-stream-collector-${var.collector_version}.jar --config /home/ubuntu/config.hocon &> /home/ubuntu/collector.log\" | crontab -",
      "nohup java -jar /home/ubuntu/snowplow-stream-collector-${var.collector_version}.jar --config /home/ubuntu/config.hocon > /home/ubuntu/collector.log &",
      "sleep 1"
    ]

    connection {
      type          = "ssh"
      user          = "ubuntu"
      private_key   = "${file("${var.key_pair_loc}")}"
    }
  }
}

Here's where it all comes together for this module. Let's break this down slightly further, by resource:

template_file.collector-config

This data source represents our configuration file. It loads config.hocon.tpl and replaces variable placeholders in the file (like ${good-stream-name} with the proper values from Terraform. In this context, this is often referred to as string interpolation.

Template files like this are extremely powerful because they allow us to configure software on the machine with contents we couldn't have known in advance. This ability is displayed here when we fill in ${access-key} and ${secret-key} with the values from the IAM user that we created earlier in the configuration.

We'll use this file in the next step!

aws_instance.collector
This entire module is brought together in this one resource.
  • We use the AWS AMI (Ubuntu) that we found with the data source in ami.tf for the operating system.

  • We add the key pair that we defined in variables.tf to set up SSH access.

  • The security groups we defined in security_groups.tf will be used to allow us to SSH into the machine, open up port 8080 to the public and open outbound connections for package installation

  • We're adding a Name tag to easily identify what this EC2 instance is doing, when looking through EC2 instances.

We're going to add two provisioning steps to this EC2 instance. We could later package these into the AMI once they become more widely used. But that wouldn't allow us to easily keep track and change them. It also wouldn't make for a very good, explanatory tutorial.

aws_instance.collector Provisioning

The provisioners are remote-exec provisioners. They will use the SSH details we've set up (in the connection attribute) to log into the machine after it has been created.

The inline attribute of the provisioners represent a list of commands to execute on the machine once an SSH connection has been established.

The first provisioner is going to log into the machine and install unzip and openjdk-8-jdk. These are needed to unzip and run the Scala Stream Collector, respectively.

The second provisioner does a few things:

  • Writes the configuration file to the server, with the template file we specified earlier.

  • Downloads the Scala Stream Collector zip file and unzips it into the home directory

  • Adds a reboot command to the crontab file so that whenever the machine is rebooted, it will restart the collector

  • Starts the collector with the configuration file (via nohup so it can leave SSH and the collector will remain running)

That's it for installing and running the collector!

elb.tf

This is the final configuration for this module. It sets up the ELB (Elastic Load Balancer) and points it at your instance.

We want a load balancer here for a couple reasons:

  1. The Elastic Load Balancer maintains stats on requests that go through it. Including how many of them are successful and how many of them fail. We can keep an eye on it to make sure the collector is working properly.

  2. If traffic gets too high, we can create more instances of the collector and put it behind the same load balancer to split the traffic up between the instances.

  3. The ELB gives us HTTPS easily and for free. We don't need to modify the collector at all and we can benefit from in-transit encryption. Due to the sensitive information analytics can contain, it'd be a good rule to always have HTTPS on your collector endpoint.

Here's the Terraform configuration file:

resource "aws_elb" "snowplow-collector" {
  name               = "snowplow-collector-elb"
  availability_zones = ["${var.aws_region}a", "${var.aws_region}b", "${var.aws_region}c"]

  listener {
    instance_port      = 8080
    instance_protocol  = "http"
    lb_port            = 443
    lb_protocol        = "https"
    ssl_certificate_id = "${var.ssl_acm_arn}"
  }

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    target              = "HTTP:8080/health"
    interval            = 30
  }

  instances                   = ["${aws_instance.collector.id}"]
  cross_zone_load_balancing   = true
  idle_timeout                = 400
  connection_draining         = true
  connection_draining_timeout = 400

  tags {
    Name = "snowplow-collector-elb"
  }
}

We give it a name and set up some availability zones to which it will route connections.

We create a listener that will listen on port 443, with an SSL-encrypted listener and point it at the HTTP port for our collector instance (port 8080). This will encrypt traffic through the internet but use HTTP within our own VPC.

Then we set up a health_check configuration. This health check will constantly ping our collector to make sure that it's still online. If it's not online, it will stop routing traffic to it.

Lastly, we connect up the the instances we've created for our collector.

Bringing it all together

In the root directory, we'll use all of these modules we've just created to complete our Snowplow Collector infrastructure.

main.tf

# Build necessary IAM roles
module "snowplow-users" { source = "iam-users" }

# Set up Kinesis Streams
module "analytics-good" {
  source = "kinesis-stream"
  name = "AnalyticsCollector-Good"
}

module "analytics-bad" {
  source = "kinesis-stream"
  name = "AnalyticsCollector-Bad"
}

# Get local machine's IP
data "http" "my-ip" {
  url = "http://icanhazip.com"
}

module "snowplow-collector" {
  source            =  "snowplow-collector"
  aws_region          = "${var.aws_region}"
  machine_ip          = "${data.http.my-ip.body}"
  key_pair_name       = "${var.key_pair_name}"
  key_pair_loc        = "${var.key_pair_location}"
  operator_access_key = "${module.snowplow-users.operator-access-key}"
  operator_secret_key = "${module.snowplow-users.operator-secret-key}"

  good_stream_name    = "${module.analytics-good.stream-name}"
  bad_stream_name     = "${module.analytics-bad.stream-name}"
  ssl_acm_arn         = "${var.ssl_acm_arn}"
}

After all of the modules we've created, this file is pretty self-explanatory!

outputs.tf

output "collector-elb-cname" { value = "${module.snowplow-collector.elb_dns}" }

Because the collector will essentially run itself from now on, all we should really need to care about is the Elastic Load Balancer's DNS name.

After you run terraform apply, you should get some output like the following:

Apply complete! Resources: 8 added, 0 changed, 0 destroyed.

Outputs:

collector-elb-cname = snowplow-collector-elb-111044469.us-west-2.elb.amazonaws.com

Final Step

The final step to setting up your collector is to point a CNAME record at it in your DNS settings.

To point analytics.example.com at my collector, you would create a CNAME record for example.com with analytics as the name and snowplow-collector-elb-111044469.us-west-2.elb.amazonaws.com as the value.

We could then check that the collector is working by going to:

https://analytics.example.com/health

The https is important because we only opened port 443 on the machine. Going to the non-HTTPS port would time out. We did this to prevent our code from accidentally pointing at the insecure collector, in the future.

Moving forward

In future posts, I'll detail how to set up a streaming enrichment process and a data storage sink for our new collector, in Terraform!

You can also take a look at the Github page for this Terraform configuration we're working on.

Want to read more posts like this? Subscribe to the blog! Sign in with Github