EXPLORING DIFFERENT SERVICES IN GCP

Exploration and documentation of different services offered in GCP

GCP Services for ML Engineering and Data Engineering

GCP_Workspace

Documenting in Notion : link for easy updatability. I will copy from notion to blog in regular intervals.

Following is the plan to explore the services and read documentations on the services

Google Cloud Learning Resources Roadmap provided by Claude AI - Planned using the Claude Haiku LLM to give me resources and plan to explore GCP in short period of time.

Rapid Plan to explore and familiarize GCP services and technologies.

Cloud Fundamentals & Data Engineering Resources

Cloud Console & Infrastructure Setup

Cloud Fundamentals & Networking

Data Engineering Foundations

Advanced Data Engineering

Machine Learning Infrastructure

Practical ML Project Resources

Deep Learning & Advanced Implementations

Deep Learning Foundations

Practical Deep Learning

Advanced ML Techniques

Real-world Project Resources

Cloud Cost Optimization

Project Documentation & Performance

Additional Comprehensive Learning Resources

🚀 Pro Learning Tips

  1. Parallel Learning
  2. Open multiple browser tabs for comprehensive resource exploration
  3. Create a structured learning environment

  4. Active Learning Techniques

  5. Take detailed, structured notes for each resource
  6. Screenshot key configurations and code snippets
  7. Create a personal knowledge repository

  8. Practical Implementation

  9. Practice immediate implementation of learned concepts
  10. Build small projects to reinforce understanding
  11. Experiment with different cloud services and tools

Recommended Learning Approach

After learning all these services , need to build a product.

Weather Data Analysis Dashboard

Complexity: Beginner

Key GCP Services:

Project Features:

Implementation of the plan :

Cloud Fundamentals & Data Engineering Resources :

Setting up :

Project setup and google SDK installation

Cloud Console & Infrastructure Setup

Official Setup GuideCloud Resource Manager

Project Name : GCP Exploration

Project ID : gcp-exploration-443707

Manually created the project

project.png

Free CreditsGoogle Cloud Free Tier

Currently have free credits to utilise and use for bigger projects with longer expiry for credits, all thanks to Google GCP team, also there are thresholds for unlimited free tier like e2-micro machine and storage upto 5gb etc. The detailed usage limit for different services are listed in the link.

SDK InstallationGoogle Cloud SDK Install

| macOS 64-bit(ARM64, Apple M1 silicon) | google-cloud-cli-darwin-arm.tar.gz | 53.9 MB | d986d0c6531be038b71218f8e7e666c5b4d18ef580d6a063550406ed07e460f9 | | --- | --- | --- | --- |

gcloudsdk.png

gcloudsdk1

Cloud Fundamentals & Networking

Compute Engine Tutorial: Quickstart Linux

Enabling the Compute engine API

https://console.cloud.google.com/apis/library/compute.googleapis.com?inv=1&invt=Abja6A&project=gcp-exploration-443707

computeapi.png

Creating a Linux VM in GCP

vm_instance.png

web_ssh.png

gcloud auth login

gcloud compute ssh --zone "asia-south1-c" "gcpexploration-1" --project "gcp-exploration-443707"

Connected to VM via local terminal

local_ssh.png

Networking Guide: VPC Overview

IAM Documentation: Identity Management Quickstart

Note : Free credits provided are exhausted, now I will be continuing to explore using the https://www.cloudskillsboost.google/

BigQuery Quickstart

big-query

SELECT * FROM ecommerce.products 
where name like '%Aluminum%'
LIMIT 1000;

# pull what sold on 08/01/2017
CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170801 AS
SELECT
  productSKU,
  SUM(IFNULL(productQuantity,0)) AS total_ordered
FROM
  `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'
GROUP BY productSKU
ORDER BY total_ordered DESC #462 skus sold


# join against product inventory to get name
SELECT DISTINCT
  website.productSKU,
  website.total_ordered,
  inventory.name,
  inventory.stockLevel,
  inventory.restockingLeadTime,
  inventory.sentimentScore,
  inventory.sentimentMagnitude,
  SAFE_DIVIDE(website.total_ordered, inventory.stockLevel) AS ratio
FROM
  ecommerce.sales_by_sku_20170801 AS website
  LEFT JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE SAFE_DIVIDE(website.total_ordered,inventory.stockLevel) >= .50
ORDER BY total_ordered DESC;


CREATE OR REPLACE TABLE ecommerce.sales_by_sku_20170802
(
productSKU STRING,
total_ordered INT64
);


INSERT INTO ecommerce.sales_by_sku_20170802
(productSKU, total_ordered)
VALUES('GGOEGHPA002910', 101);

SELECT * FROM ecommerce.sales_by_sku_20170801
UNION ALL
SELECT * FROM ecommerce.sales_by_sku_20170802;

SELECT * FROM `ecommerce.sales_by_sku_2017*`;

SELECT * FROM `ecommerce.sales_by_sku_2017*`
WHERE _TABLE_SUFFIX = '0802';

ex1_big_query

Creating Date-Partitioned Tables in BigQuery Lab

big_query_module2

## Creating partitioned table to improve efficiency while querying on date

 CREATE OR REPLACE TABLE ecommerce.partition_by_day
 (
  date_formatted date,
  fullvisitorId INT64
 )
 PARTITION BY date_formatted
 OPTIONS(
   description="a table partitioned by date"
 ) 


INSERT INTO `ecommerce.partition_by_day`
SELECT DISTINCT
 PARSE_DATE("%Y%m%d", date) AS date_formatted,
 fullvisitorId
 FROM `data-to-insights.ecommerce.all_sessions_raw`;

drop table `ecommerce.partition_by_day`;


#standardSQL
 CREATE OR REPLACE TABLE ecommerce.partition_by_day
 PARTITION BY date_formatted
 OPTIONS(
   description="a table partitioned by date"
 ) AS

 SELECT DISTINCT
 PARSE_DATE("%Y%m%d", date) AS date_formatted,
 fullvisitorId
 FROM `data-to-insights.ecommerce.all_sessions_raw`


 #standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2016-08-01';


-- Bytes processed 25.05 KB ,Bytes billed 10 MB , Slot milliseconds 26

#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.partition_by_day`
WHERE date_formatted = '2018-07-08';


-- Duration 0 sec,  Bytes processed 0 B , Bytes billed 0 B


-- Syntax to create a expiring partitioned query for a partition 
-- standardSQL
 CREATE OR REPLACE TABLE ecommerce.days_with_rain
 PARTITION BY date
 OPTIONS(
   partition_expiration_days = 60,
   description="weather stations with precipitation, partitioned by day"
 ) AS
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter stations/days with no precipitation
   AND _TABLE_SUFFIX >= '2018';

Troubleshooting and Solving Data Join Pitfalls

Exercise insights

-- Complete lab
with a as (
SELECT distinct productSKU , v2ProductName FROM 
`data-to-insights.ecommerce.all_sessions_raw`),
b as (
  select productSKU 
  , row_number() over(partition by productSKU) as rn_prod
  ,v2ProductName
  ,  row_number() over(partition by v2ProductName) as rn_prodname
from a
)
select * from b where rn_prodname > 1
order by rn_prodname desc;


-- other way
SELECT
  productSKU,
  COUNT(DISTINCT v2ProductName) AS product_count,
  ARRAY_AGG(DISTINCT v2ProductName LIMIT 5) AS product_name
FROM `data-to-insights.ecommerce.all_sessions_raw`
  WHERE v2ProductName IS NOT NULL
  GROUP BY productSKU
  HAVING product_count > 1
  ORDER BY product_count DESC

#standardSQL
CREATE OR REPLACE TABLE ecommerce.site_wide_promotion AS
SELECT .05 AS discount;

INSERT INTO ecommerce.site_wide_promotion (discount)
VALUES (.04),
       (.03);
SELECT discount FROM ecommerce.site_wide_promotion

#standardSQL
SELECT DISTINCT
productSKU,
v2ProductCategory,
discount
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
CROSS JOIN ecommerce.site_wide_promotion
WHERE v2ProductCategory LIKE '%Clearance%'
AND productSKU = 'GGOEGOLC013299'

Working with JSON, Arrays, and Structs in BigQuery

Exercise insights

big_query_4

-- example schema while creating one of the table
[
    {
        "name": "race",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "participants",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "name",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "splits",
                "type": "FLOAT",
                "mode": "REPEATED"
            }
        ]
    }
]

Some of the queries to remember the syntax, UNNEST, ARRAY_AGG , ARRAY_LENGTH

#standardSQL
SELECT
['raspberry', 'blackberry', 'strawberry', 'cherry'] AS fruit_array;

#standardSQL
SELECT
['raspberry', 'blackberry', 'strawberry', 'cherry', 1234567] AS fruit_array;
--Array elements of types {INT64, STRING} do not have a common supertype at [3:1]

#standardSQL
SELECT person, fruit_array, total_cost FROM `data-to-insights.advanced.fruit_store`;

SELECT
  fullVisitorId,
  date,
  ARRAY_AGG(v2ProductName) AS products_viewed,
  ARRAY_AGG(pageTitle) AS pages_viewed
  FROM `data-to-insights.ecommerce.all_sessions`
WHERE visitId = 1501570398
GROUP BY fullVisitorId, date
ORDER BY date


SELECT
  fullVisitorId,
  date,
  ARRAY_AGG(DISTINCT v2ProductName) AS products_viewed,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT v2ProductName)) AS distinct_products_viewed,
  ARRAY_AGG(DISTINCT pageTitle) AS pages_viewed,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT pageTitle)) AS distinct_pages_viewed
  FROM `data-to-insights.ecommerce.all_sessions`
WHERE visitId = 1501570398
GROUP BY fullVisitorId, date
ORDER BY date


SELECT
  visitId,
  hits.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
-- you can not access it directly from array, 

SELECT DISTINCT
  visitId,
  h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10
-- UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table)


SELECT
  visitId,
  totals.*,
  device.*
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
LIMIT 10;
-- the .* syntax tells BigQuery to return all fields for that STRUCT (much like it would if totals.* was a separate table we joined against).

#standardSQL
SELECT STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner

#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN
race_results.participants # full STRUCT name


#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants

#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p


#he total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the UNNEST() operator and start with the partially written query below.
SELECT
  p.name,
  SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;


#see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that particular lap. Create a query that returns that result.
SELECT
  p.name,
  split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;

Lab link - gcp big query

Build a Data Warehouse with BigQuery (Challenge)

Challenge to solve few concepts

Prob 1

Solving process

 CREATE OR REPLACE TABLE covid.oxford_policy_tracker
 PARTITION BY date
 OPTIONS(
   partition_expiration_days = 1445,
   description="Covid oxford_policy_tracker"
 ) AS
SELECT *  FROM `bigquery-public-data.covid19_govt_response.oxford_policy_tracker` 
where alpha_3_code not in ('GBR','CAN','BRA','USA');

Prob 2

Solving process

CREATE OR REPLACE TABLE covid_data.country_area_data
AS
SELECT * FROM `bigquery-public-data.census_bureau_international.country_names_area`;

Prob3

Solving process

CREATE OR REPLACE TABLE covid_data.mobility_data
AS
select * from `bigquery-public-data.covid19_google_mobility.mobility_report`;

Prob 4

Solving process

-- creating a copy backup as I am changing in place for new values.

CREATE OR REPLACE TABLE covid_data.oxford_policy_tracker_by_countries_copy
AS 
select * from covid_data.oxford_policy_tracker_by_countries;

-- It was important to create a backup table always beofre operating on the table in production
-- Initially I did do mistakes while deleting , but could recover using the copy table.
-- Rule 1 to create a copy table of modifying table.

DELETE from covid_data.oxford_policy_tracker_by_countries
where population is null or country_area is null;

Completed the course, earned the badge for completing this module : Certificate

big_query_completion

Cloud Storage Guide: Storage Quickstart

Next in-depth analysis is on storage layer

Also documenting in Notion : link.

Back to home