Using Amazon EC2, Amazon Redshift and Amazon S3 with Data-Warehousing and Business Intelligence Tools

Using Amazon EC2, Amazon Redshift and Amazon S3 with Data-Warehousing and Business Intelligence Tools

Take Advantage of the Scale of the Cloud to Deliver High Throughput on a Consistent Basis.

Probably, you have been hearing or reading so much about Amazon Web Service (AWS) - Cloud Computing but you are not sure what it looks like or the opportunities it presents to you and your organization. This article will serve as an eye opener into unlocking a lot of potentials you could have with AWS.

Cloud computing services are no longer the future of IT, but the reality for data-warehouse, BI and AI. According to 451 Research, 69% of enterprises will have multi-cloud/hybrid IT environments by 2019. With a five-year CAGR of 19%, cloud computing as a service will reach $53.3billionin 2021. While many companies are transitioning workloads to a cloud-based infrastructure platform with many companies still in the early stages of experimentation with cloud-based services, some companies are still wondering how to leverage the power of data on cloud, which provides Angelify.Ai with enormous opportunity to educate businesses as they make the transition from cloud experimentation to running their most important mission-critical applications and workloads on the AWS Cloud.

On the next page, we discuss the power of visualizing data using understanding of Data-Warehousing and Business-Intelligence concepts with Amazon Web Services (AWS) core services; Find Underlying Patterns in the Data and Help Business Decisions.

Using the AWS, I connected with Amazon Redshift cluster, Amazon EC2 Instance, Amazon S3 Bucket in order to build a dashboard using Tableau Desktop. Amazon Redshift uses industry-standard SQL which can be accessed using standard JDBC and ODBC drivers.

Note: Your existing business intelligence tools can easily integrate with Amazon Redshift.

Topics covered:

  1. Overview of Amazon Redshift, Amazon EC2, Amazon S3
  2. Amazon Redshift primer
  3. Launch Amazon Redshift cluster
  4. Amazon S3
  5. Amazon Redshift integrates with existing business intelligent tools like PostgreSQL (“pgweb”) AWS third-party client software
  6. Query Data from Amazon Redshift using PostgreSQL (“pgweb”) AWS third-party client software
  7. Monitor Amazon Redshift performance
  8. Load Data into Amazon Redshift from Amazon S3
  9. Run Queries in PostgreSQL to find underlying patterns in the data and to help business decisions.
  10. Connect to an Amazon Redshift cluster from Tableau
  11. Create a dashboard using Tableau

1- Overview of Amazon Redshift, Amazon EC2, Amazon S3

Amazon Redshift has the following capabilities:

  • Data warehousing and Analytics
  • Fast
  • Fully managed
  • Petabytes-scale
  • Use existing business-intelligent tools
  • Set-up, operate and scale your data-warehouse
  • Provision and monitor infrastructure capacity
  • Automate ongoing administrative tasks
  • Monitors node and drives for failure and recovery
  • Snapshots as Backup
  • Secure data-warehouse

Amazon Redshift Uses:

  • JDBC
  • ODBC
  • PostgreSQL

Amazon EC2 (Amazon Elastic Compute Cloud): provides scalable computing capacity in the AWS cloud. Using Amazon EC2 eliminates the need to invest in the hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch several virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.

2- Amazon Redshift Primer

Amazon Redshift delivers fast query and I/O performance for virtually any size of dataset by via:

3- Launch Amazon Redshift cluster

  • Nodes & Cluster: An Amazon Redshift is a collection of computing resources called nodes. Collection of these nodes is called a cluster. During provision of a cluster, you are required to specify the node type and the number of nodes that will make up the cluster. The nodes type you selected will determine the storage size, memory, CPU, and price of each node in the cluster:
  • Scalability: you can always scale the cluster (in or out) by adding or removing nodes, scale the cluster (up or down) by specifying a different node type, or you can do both operations, respectively, should you need change after initial provision of your cluster.
  • Columnar storage technology: Amazon Redshift stores data by column rather than storing the data values together for a whole row. This drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.
  • Parallelizing and distributing queries across multiple nodes: Amazon Redshift distribute workload to each node in a cluster and processes work in parallel, allowing processing speed to scale in addition to storage.
  • Compression: this is a column-level operation that reduces the size of data when it is stored which reduces the amount of disk I/O and therefore improves query performance.
  • Snapshots as Backup: snapshots are point-in time backups o a cluster which can be created automatically or manually by you. Amazon Redshift stores these snapshots internally in Amazon S3 using an encrypted Secure Sockets Layer (SSL) connections. If you need to restore a cluster, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify.

4- Amazon S3

A durable, scalable object store like a collection of static files. It features web interface that can be used to store and retrieve any amount of data at any time, from anywhere on the web. It is a storage for the internet.

5- Amazon Redshift integrates with Existing Business Intelligent tools like PostgreSQL (“pgweb”), AWS third-party client software.

Typical Data Loading Process into Amazon Redshift is:

  • Data is exported from a source system
  • The data is placed into an Amazon S3 bucket
  • The data is copied into Amazon Redshift tables via COPY command
  • The SQL client is used to query Amazon Redshift
  • The results of the query will be returned to the SQL client.

See illustration below:

6. Query Data from Amazon Redshift using Web-Based PostGreSQL (pgweb)

In this lab, I will be loading USA domestic airline data for analysis. The data has been obtained from the United States Department of Transportation’s Bureau of Transportation Statistics. I have placed the transport data already into an Amazon S3 bucket in a compressed format.

Here, I will show the screen shots of Amazon Redshift cluster and then the running queries I used to analyse the data in pgweb.

Node type

Choose a node type that meets your CPU, RAM, storage capacity, and drive type requirements. For this lab, we create clusters that uses dc2.large node size with 160GB storage per node.

7. Monitor Amazon Redshift performance

You can use the following screen shots to monitor Redshift performance:

 

  1. Amazon Redshift Clusters Query Monitoring
  1. Amazon Redshift Clusters
  1. Amazon Redshift Snapshots
  1. Amazon Redshift Queries and Loads
  1. Amazon Redshift Cluster Performance
  1. Amazon Redshift Maintenance and Monitoring
  1. Amazon Redshift Backup
  1. Amazon Redshift Properties

Web-Based PostgreSQL Client (“pgweb”) Connected Amazon Redshift

To connect Amazon Redshift with pgweb, I opened a new web browser tab, pasted and searched the IP address for the pgweb client and then configured the settings using the Host value obtained from the Redshift management console. Once the cluster is available Redshift will display Endpoint.

8. Load Data into Amazon Redshift from Amazon S3 and Run Query in PostgreSQL

To load data, we first create a Table called Flights in Amazon Redshift using this text in pgweb: 
 
CREATE TABLE flights (

  year           smallint,

  month          smallint,

  day            smallint,

  carrier        varchar(80) DISTKEY,

  origin         char(3),

  dest           char(3),

  aircraft_code  char(3),

  miles          int,

  departures     int,

  minutes        int,

  seats          int,

  passengers     int,

  freight_pounds int

);

 

After creating the table we can now load data into the table. Remember the data has already been placed into an Amazon S3 bucket and can be loaded into Amazon Redshift by using the COPY command:

COPY flights

FROM 's3://us-west-2-aws-training/awsu-spl/spl-17/4.2.5.prod/data/flights-usa'

IAM_ROLE 'INSERT-YOUR-REDSHIFT-ROLE'

GZIP

DELIMITER ','

REMOVEQUOTES

REGION 'us-west-2';

It will take approximately 3mins to lead the data.

The data loaded consists of:

  • 23 data files in CSV format
  • Comprising 6 GB of data
  • Compressed with GZIP down to only 700MB of storage

The data files are being loaded in parallel from Amazon S3, thereby, parallelizing and distributing queries across multiple nodes. This is the most efficient way to load data in Amazon Redshift since the load process is distributed across multiple slices across available nodes.

Each slice of a compute node is allocated a portion of node’s memory and disk spce, where it processes a portion of the workload assigned to the node. The leader node is managing the distribution of data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

 

DISTKEY - Distribution Key

When you create a table, you have an option to specify one column as the distribution key. The CREATE TABLE command ran earlier in Section 8, designated the carrier (airline) field as the Distribution Key (DISTKEY). This means the data will be split between the all available slices and nodes, but all data relating carrier will always reside on the same slice. This improves processing speed when performing operations on the carrier field, such as GROUP BY and JOIN operations.   

9. Run Queries in PostgreSQL to find underlying patterns in the data and to help business decisions.

You will see from the screen shot below the various queries that were run on pgweb. For instance, a

SELECT COUNT(*) FROM Flights;

Shows a value of 96,825,753 records; meaning that almost 100 million rows have been loaded into the table called Filghts in less than 3mins. You can also use SQL command to query the data.

Now you have the tables loaded, the next step is to perform queries to find underlying patterns in the data and to help business decisions.

Now to view 10 random rows of data you run this query:

SELECT *

FROM Flights

ORDER BY random()

LIMIT 10;

Here is an explanation of the columns returned:

Here is an explanation of the columns returned:

XII. PGWEB

 

The screen shot contains several queries used to generate and join tables in order to explain the underlying patterns in the ~100 million rows of data and help business decisions.

 

XII. Amazon Redshift Partner Products AWS Marketplace

 

 

10. Connect to an Amazon Redshift cluster from Tableau

Install Tableau on Windows Server Desktop follow the recommended instructions and steps. You can always choose free trial but registration form is required to complete registration. After registration click on the Amazon Redshift from the Tableau Desktop. Here you can start you analysis using the schema.

List of Tables Created and Loaded Earlier in the Amazon S3 via Amazon Redshift Schema

In the screenshot below Tableau is cannot find the right relationship between the tables listed in the schema, it will present you with an option to define the right join.

11. Create a dashboard using Tableau

We are now ready to create visualizations or sheets in Tableau, to analyse and understand the delays in departure of flights by Airports, Time and Carriers.

Visualization in Tableau to Understand the Highest Delays by Airports in Relation to the Number of Flights

Visualization in Tableau Showing the Trend Delayed Flights over Time at the Years

References:

AWS training and certificate 2020, Inc. and its affiliates: Qwiklabs.