Foundation Data Pipeline with Google Cloud

Introduction

This architecture uses click-to-deploy so you can spin up infrastructure in minutes using terraform!

In today's world, data matters more than ever. It can be used to make better decisions, improve efficiency, and gain a competitive edge. However, data management can be challenging. It can be difficult to keep data organized, secure, and accessible.

That's where data pipelines come in. Data pipelines are a set of processes that move data from one place to another. They can be used to collect data from different sources, transform it, and load it into a data warehouse or data lake.

The following click-to-deploy architecture deploys a secure and scalable data pipeline that can help you to achieve data agility with Google Cloud Storage, Dataflow and BigQuery. It uses the least privilege principles to ensure that only the minimum amount of permissions are required to transfer data.

You can learn more about cloud-based ETL here.

This repo is based on the Cloud Foundation Fabric blueprint available here.

Use cases

Whether you’re transferring from another Cloud Service Provider or you’re taking your first steps into the cloud with Google Cloud, building a data pipeline sets a good foundation to begin deriving insights for your business.

These are some examples of the use cases you can build on top of this architecture:

Any application that requires some kind of transformation to data to be able to analyze it
Data Privacy: Dataflow allows you to call the Data Loss Prevention API to identify PII that you might not want to have in a readable format and de-identify it.
Anomaly Detection: building data pipelines to identify cyber security threats or fraudulent transactions using machine learning (ML) models.
Interactive Data Analysis: carry out interactive data analysis with BigQuery BI Engine that enables you to analyze large and complex datasets interactively with sub-second query response time and high concurrency.
Predictive Forecasting: building solid pipelines to capture real-time data for ML modeling and using it as a forecasting engine for situations ranging from weather predictions to market forecasting.
Create Machine Learning models: using BigQuery ML you can create and execute machine learning models in BigQuery using standard SQL queries. Create a variety of models pre-built into BigQuery that you train with your data.

Architecture

The main components that we would be setting up are (to learn more about these products, click on the hyperlinks):

Cloud Storage (GCS) bucket: data lake solution to store extracted raw data that must undergo some kind of transformation.
Cloud Dataflow pipeline: to build fully managed batch and streaming pipelines to transform data stored in GCS buckets ready for processing in the Data Warehouse using Apache Beam.
BigQuery datasets and tables: to store the transformed data in and query it using SQL, use it to make reports or begin training machine learning models without having to take your data out.
Service accounts (created with least privilege on each resource): one for uploading data into the GCS bucket, one for Orchestration, one for Dataflow instances and one for the BigQuery tables. You can also configure users or groups of users to assign them a viewer role on the created resources and the ability to impersonate service accounts to test the Dataflow pipelines before automating them with a tool like Cloud Composer.

For a full list of the resources that will be created, please refer to the github repository for this project. If you're migrating from another Cloud Provider, refer to this documentation to see equivalent services and comparisons in Microsoft Azure and Amazon Web Services

Costs

Pricing Estimates - We have created a sample estimate based on some usage we see from new startups looking to scale. This estimate would give you an idea of how much this deployment would essentially cost per month at this scale and you extend it to the scale you further prefer. Here's the link.

Setup

This solution assumes you already have a project created and set up where you wish to host these resources. If not, and you want the system to create a new project for you, please refer to the GitHub repository for detailed instructions.

Prerequisites

Have an organization set up in Google cloud.
Have a billing account set up.
Have an existing project with billing enabled, we’ll call this the service project.

Roles & Permissions

In order to spin up this architecture, you will need to be a user with the “Project owner” IAM role on the existing project:

Note: For granting a user a role, consult the documentation on Granting and Revoking Access.

Deploy the architecture

Before we deploy the architecture, you will need the following information:

The service project ID.
A unique prefix that you want all the deployed resources to have (for example: awesomestartup). This must be a string with no spaces or tabs.
A list of Groups or Users with Service Account Token creator role on Service Accounts in IAM format, eg 'group:group@domain.com'.

Click on the button below. Sign in if necessary. When the prompt appears, click 'Confirm'. The system will guide you through the setup process for your architecture.

This is the startup screen that appears after clicking the button and confirming:

Throughout the process, you will be prompted for user input. Refer to the variables explained at the bottom of this ReadMe file. If the process fails, click the button again to retry.

🎉 Congratulations! 🎉
You have successfully deployed the foundation for running your first ETL pipeline on Google Cloud.

Testing your architecture

For the purpose of demonstrating how the ETL pipeline flow works, we’ve set up an example pipeline for you to run. First of all, we assume all the steps are run using a user listed on the data_eng_principles variable (or a user that belongs to one of the groups you specified). Authenticate the user using the following command and make sure your active cloudshell session is set to the service project:

    gcloud auth application-default login

Follow the instructions in the cloudshell to authenticate the user.

To make the next steps easier, create two environment variables with the service project id and the prefix:

    export SERVICE_PROJECT_ID=[SERVICE_PROJECT_ID]
    export PREFIX=[PREFIX]

Again, make sure you’re in the following directory:

    cloudshell_open/cloud-foundation-fabric/blueprints/data-solutions/gcs-to-bq-with-least-privileges

For the purpose of the example we will import a CSV file from GCS to BigQuery. This has the following structure:

    name,surname,timestamp

We need to create 3 files:

person.csv This file should contain your data in the format: name, surname, timestamp. For instance: `Eva,Rivarola,1637771951'.
person_udf.js This file should contain the User-Defined Function (UDF) JavaScript code used by the Dataflow template.
person_schema.json This file should contain the table schema used to import the CSV data.

An example of those files can be found in the folder ./data-demo inside the same repository you're currently in.

You can copy the example files into the GCS bucket by running:

    gsutil -i gcs-landing@$SERVICE_PROJECT_ID.iam.gserviceaccount.com cp data-demo/* gs://$PREFIX-data

After completion, the three essential files required to execute the Dataflow Job will be copied to the GCS bucket created alongside the resources.

Run the following command to start the dataflow job:

    gcloud --impersonate-service-account=orchestrator@$SERVICE_PROJECT_ID.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project $SERVICE_PROJECT_ID \
--region europe-west1 \
--disable-public-ips \
--subnetwork https://www.googleapis.com/compute/v1/projects/$SERVICE_PROJECT_ID/regions/europe-west1/subnetworks/subnet \
--staging-location gs://$PREFIX-df-tmp\
--service-account-email df-loading@$SERVICE_PROJECT_ID.iam.gserviceaccount.com \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://$PREFIX-data/person_schema.json,\
javascriptTextTransformGcsPath=gs://$PREFIX-data/person_udf.js,\
inputFilePattern=gs://$PREFIX-data/person.csv,\
outputTable=$SERVICE_PROJECT_ID:datalake.person,\
bigQueryLoadingTemporaryDirectory=gs://$PREFIX-df-tmp

This command will start a Dataflow job called test_batch_01 that uses a Dataflow transformation script stored in the public GCS bucket:

    gs://dataflow-templates/latest/GCS_Text_to_BigQuery.

The expected output is the following:

Then, if you navigate to Dataflow on the console, you will see the following:

This shows the job you started from the cloudshell is currently running in Dataflow. If you click on the job name, you can see the job graph created and how every step of the Dataflow pipeline is moving along:

Once the job completes, you can navigate to BigQuery in the console and under SERVICE_PROJECT_ID → datalake → person, you can see the data that was successfully imported into BigQuery through the Dataflow job.

Cleaning up your environment

The easiest way to remove all the deployed resources is to run the following command in Cloud Shell:

deploystack uninstall

The above command will remove the associated resources so there will be no billable charges made afterwards.

Note: This will also terminate the BigQuery dataset as the following option in main.tf is set to true: delete_contents_on_destroy.

Variables

name	description	type	required	default
prefix	Unique prefix used for resource names. Not used for project if 'project_create' is null.	`string`	✓
project_id	Project id, references existing project if `project_create` is null.	`string`	✓
cmek_encryption	Flag to enable CMEK on GCP resources created.	`bool`		`false`
data_eng_principals	Groups with Service Account Token creator role on service accounts in IAM format, eg 'group:group@domain.com'.	`list(string)`		`[]`
network_config	Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values.	`object({…})`		`null`
project_create	Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format.	`object({…})`		`null`
region	The region where resources will be deployed.	`string`		`"europe-west1"`
vpc_subnet_range	Ip range used for the VPC subnet created for the example.	`string`		`"10.0.0.0/20"`

Outputs

name	description	sensitive
bq_tables	Bigquery Tables.
buckets	GCS bucket Cloud KMS crypto keys.
command_01_gcs	gcloud command to copy data into the created bucket impersonating the service account.
command_02_dataflow	Command to run Dataflow template impersonating the service account.
command_03_bq	BigQuery command to query imported data.
project_id	Project id.
service_accounts	Service account.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.deploystack		.deploystack
.github		.github
data-demo		data-demo
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
backend.tf.sample		backend.tf.sample
bigquery.tftpl		bigquery.tftpl
cloud_shell.png		cloud_shell.png
cloudbuild.yaml		cloudbuild.yaml
dataflow.tftpl		dataflow.tftpl
dataflow_console.png		dataflow_console.png
dataflow_execution.png		dataflow_execution.png
datastorage.tf		datastorage.tf
diagram.png		diagram.png
editor.png		editor.png
gcp_banner.png		gcp_banner.png
kms.tf		kms.tf
main.tf		main.tf
output.png		output.png
outputs.tf		outputs.tf
second_output.png		second_output.png
serviceaccounts.tf		serviceaccounts.tf
shell_button.png		shell_button.png
terraform.tfvars.sample		terraform.tfvars.sample
tutorial.md		tutorial.md
variables.tf		variables.tf
versions.tf		versions.tf
vpc.tf		vpc.tf

License

GoogleCloudPlatform/deploystack-gcs-to-bq-with-least-privileges

Folders and files

Latest commit

History

Repository files navigation

Foundation Data Pipeline with Google Cloud

Introduction

Use cases

Architecture

Costs

Setup

Prerequisites

Roles & Permissions

Deploy the architecture

🎉 Congratulations! 🎉 You have successfully deployed the foundation for running your first ETL pipeline on Google Cloud.

Testing your architecture

Cleaning up your environment

Variables

Outputs

About

Resources

License

Stars

Watchers

Forks

Languages

🎉 Congratulations! 🎉
You have successfully deployed the foundation for running your first ETL pipeline on Google Cloud.