How to configure AWS Kinesis Firehose to be able to copy into Redshift

amazon-web-services

Problem

I have set up a Firehose and selected options that it seemed would result in the streamed data to end up in Redshift. I have a single Firehose and a single target table in Redshift. The data is showing up correctly in the intermediate S3 bucket but not making it to Redshift.

I have allowed the Firehose region CIDR's [1] access to the cluster via Redshift security groups. I have given the Firehose the credentials to the Redshift cluster and have tested the credentials. The cluster is publicly accessible.

The STL_LOAD_ERRORS table is empty which is where I would expect to see the errors produced by any copy statement that was attempted and failed. There are also no DeliveryToRedshift events in CloudWatch. This leads me to believe that there isn't even an attempt making it to Redshift.

I am using Redshift COPY options to specify a jsonpath file for the copy. I currently have: json "s3://app-event-data/_jsonpaths/_bigtable.jsonpath.json". That file exists in s3 at that location and looks like the jsonpath file [3]. I also tried adding credentials to that box but I don't that is necessary.

Update 1

I enabled logging in Redshift which showed me that there are successful authentications being made from Firehose. I then created a new parameter group with user activity logging enabled and rebooted the cluster. Waiting for the next log batch now.

References

  1. http://docs.aws.amazon.com/firehose/latest/dev/firehose-dg.pdf
  2. https://aws.amazon.com/kinesis/firehose/
  3. http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

Why ServerFault? Best StackExchange site for asking about Amazon AWS

I would have tagged this question: amazon-kinesis amazon-firehose amazon-redshift

Best Answer

Here are all of the things I initially misconfigured and eventually adjusted to get Firehose to properly copy data into Redshift:

  1. Make sure the bucket you select is in the same region as your redshift cluster. You can specify the REGION if for some reason you want to incur additional charges. You cannot change the region of a bucket after you create it.

bucket region select

This is a bit confusing at first because the region selector in the buckets console says "global".

enter image description here

  1. Firehose will not quote identifiers for you so if the table name you provide in the form requires quotes, you need to add them yourself. Same goes for column names.

table name specification

  1. Single quotes only in the copy options.

  2. Give Firehose access to your Redshift cluster by authorizing the correct CIDR/IP from the Redshift security tab. You can find the correct CIDR/IP for the region you created your Firehose in here. You don't need to create a new Cluster Security group to do this. You can just add it to the default one. If you add a new security group then you need to reconfigure your cluster from the dashboard to use the new security group, it can't have 2.

  3. Redshift COPY doesn't accept timestamp with timezones outside of UTC. If you are using a format such as RFC3339 you need to put the times in UTC first and specify TIMEFORMAT 'auto' in the copy options.

  4. In order to have any idea what is going on when things aren't working, you need to enable User Activity Logging in Redshift by:

    • enabling logging (obviously)

enter image description here

  • creating a new parameter group

enter image description here

  • setting user_activity_logging to true

enter image description here

  • modifying you cluster to use the new parameter group

enter image description here

enter image description here

  • roboot your cluster

enter image description here

Enable user activity logging achievement unlocked