Take Advantage of the Scale of the Cloud to Deliver High Throughput on a Consistent Basis.
Probably, you have been hearing or reading so much about Amazon Web Service (AWS) - Cloud Computing but you are not sure what it looks like or the opportunities it presents to you and your organization. This article will serve as an eye opener into unlocking a lot of potentials you could have with AWS.
Cloud computing services are no longer the future of IT, but the reality for data-warehouse, BI and AI. According to 451 Research, 69% of enterprises will have multi-cloud/hybrid IT environments by 2019. With a five-year CAGR of 19%, cloud computing as a service will reach $53.3billionin 2021. While many companies are transitioning workloads to a cloud-based infrastructure platform with many companies still in the early stages of experimentation with cloud-based services, some companies are still wondering how to leverage the power of data on cloud, which provides Angelify.Ai with enormous opportunity to educate businesses as they make the transition from cloud experimentation to running their most important mission-critical applications and workloads on the AWS Cloud.
On the next page, we discuss the power of visualizing data using understanding of Data-Warehousing and Business-Intelligence concepts with Amazon Web Services (AWS) core services; Find Underlying Patterns in the Data and Help Business Decisions.
Using the AWS, I connected with Amazon Redshift cluster, Amazon EC2 Instance, Amazon S3 Bucket in order to build a dashboard using Tableau Desktop. Amazon Redshift uses industry-standard SQL which can be accessed using standard JDBC and ODBC drivers.
Note: Your existing business intelligence tools can easily integrate with Amazon Redshift.
Topics covered:
Amazon Redshift has the following capabilities:
Amazon Redshift Uses:
Amazon EC2 (Amazon Elastic Compute Cloud): provides scalable computing capacity in the AWS cloud. Using Amazon EC2 eliminates the need to invest in the hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch several virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.
Amazon Redshift delivers fast query and I/O performance for virtually any size of dataset by via:
A durable, scalable object store like a collection of static files. It features web interface that can be used to store and retrieve any amount of data at any time, from anywhere on the web. It is a storage for the internet.
Typical Data Loading Process into Amazon Redshift is:
See illustration below:
In this lab, I will be loading USA domestic airline data for analysis. The data has been obtained from the United States Department of Transportation’s Bureau of Transportation Statistics. I have placed the transport data already into an Amazon S3 bucket in a compressed format.
Here, I will show the screen shots of Amazon Redshift cluster and then the running queries I used to analyse the data in pgweb.
Node type
Choose a node type that meets your CPU, RAM, storage capacity, and drive type requirements. For this lab, we create clusters that uses dc2.large node size with 160GB storage per node.
You can use the following screen shots to monitor Redshift performance:
Web-Based PostgreSQL Client (“pgweb”) Connected Amazon Redshift
To connect Amazon Redshift with pgweb, I opened a new web browser tab, pasted and searched the IP address for the pgweb client and then configured the settings using the Host value obtained from the Redshift management console. Once the cluster is available Redshift will display Endpoint.
To load data, we first create a Table called Flights in Amazon Redshift using this text in pgweb:
CREATE TABLE flights (
year smallint,
month smallint,
day smallint,
carrier varchar(80) DISTKEY,
origin char(3),
dest char(3),
aircraft_code char(3),
miles int,
departures int,
minutes int,
seats int,
passengers int,
freight_pounds int
);
After creating the table we can now load data into the table. Remember the data has already been placed into an Amazon S3 bucket and can be loaded into Amazon Redshift by using the COPY command:
COPY flights
FROM 's3://us-west-2-aws-training/awsu-spl/spl-17/4.2.5.prod/data/flights-usa'
IAM_ROLE 'INSERT-YOUR-REDSHIFT-ROLE'
GZIP
DELIMITER ','
REMOVEQUOTES
REGION 'us-west-2';
It will take approximately 3mins to lead the data.
The data loaded consists of:
The data files are being loaded in parallel from Amazon S3, thereby, parallelizing and distributing queries across multiple nodes. This is the most efficient way to load data in Amazon Redshift since the load process is distributed across multiple slices across available nodes.
Each slice of a compute node is allocated a portion of node’s memory and disk spce, where it processes a portion of the workload assigned to the node. The leader node is managing the distribution of data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
DISTKEY - Distribution Key
When you create a table, you have an option to specify one column as the distribution key. The CREATE TABLE command ran earlier in Section 8, designated the carrier (airline) field as the Distribution Key (DISTKEY). This means the data will be split between the all available slices and nodes, but all data relating carrier will always reside on the same slice. This improves processing speed when performing operations on the carrier field, such as GROUP BY and JOIN operations.
You will see from the screen shot below the various queries that were run on pgweb. For instance, a
SELECT COUNT(*) FROM Flights;
Shows a value of 96,825,753 records; meaning that almost 100 million rows have been loaded into the table called Filghts in less than 3mins. You can also use SQL command to query the data.
Now you have the tables loaded, the next step is to perform queries to find underlying patterns in the data and to help business decisions.
Now to view 10 random rows of data you run this query:
SELECT *
FROM Flights
ORDER BY random()
LIMIT 10;
Here is an explanation of the columns returned:
Here is an explanation of the columns returned:
XII. PGWEB
The screen shot contains several queries used to generate and join tables in order to explain the underlying patterns in the ~100 million rows of data and help business decisions.
XII. Amazon Redshift Partner Products AWS Marketplace
Install Tableau on Windows Server Desktop follow the recommended instructions and steps. You can always choose free trial but registration form is required to complete registration. After registration click on the Amazon Redshift from the Tableau Desktop. Here you can start you analysis using the schema.
List of Tables Created and Loaded Earlier in the Amazon S3 via Amazon Redshift Schema
In the screenshot below Tableau is cannot find the right relationship between the tables listed in the schema, it will present you with an option to define the right join.
We are now ready to create visualizations or sheets in Tableau, to analyse and understand the delays in departure of flights by Airports, Time and Carriers.
Visualization in Tableau to Understand the Highest Delays by Airports in Relation to the Number of Flights
Visualization in Tableau Showing the Trend Delayed Flights over Time at the Years
References:
AWS training and certificate 2020, Inc. and its affiliates: Qwiklabs.