sql learning micro course on kaggle website

Posted by bex1111 on Wed, 15 Sep 2021 02:27:04 +0200

catalogue

Start sql journey

Exercise 1: count the number of tables in the dataset

Exercise 2: explore the summary of the table

Exercise 3: create a crime map

Select/From/Where simple query

Exercise 1: identify countries that use ppm

Exercise 2:   select * query all fields

Group by&Having&Count 

Exercise 1: find out the authors with tens of thousands of comments

Exercise 2: query deleted comments

Order by&date&extract

Exercise 1: inquire about government expenditure on Education

Exercise 2: find more interesting code

With & As and common table expressions

Exercise 1: how much slower will the traffic be when the traffic volume becomes larger

Exercise 2: review the first five lines

Exercise 3: count the total number of trips per year

Exercise 4: count the total number of trips per month in 2017

Exercise 5: number and speed per hour

joining data

Exercise 1: look at the watches

Exercise 2: view the table you want to view

Exercise 3: determine the research direction

Exercise 4: find the answer to the question through the connection

Exercise 5: count the number of users and answer questions

Exercise 6: write a general function to get experts on any topic    

Start sql journey

The first test of your new data exploration skills uses data describing crime in Chicago.

1. Create a client object to retrieve information

The tutorial says that you will soon see that this client object will play a central role in retrieving information from the BigQuery dataset.

I think it's fascinating anyway

client=bigquery.Client()

2. First, use the dataset () method to construct a reference to the dataset.

dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")

3. Next, we use get_ The dataset () method and the reference just constructed to get the dataset

dataset = client.get_dataset(dataset_ref)

Exercise 1: count the number of tables in the dataset

tables = list(client.list_tables(dataset))

print(len(tables))

Here you can see the list of datasets_ Tables is not what I expected, but client. At present, it can be understood as client operation? Then I listed it again. It's hard to understand. Recite it first. I checked the length of tables in print. Here, I can't guess what data format tables is. It may be that tables received a list of datasets, so now there should be a list name arrangement in tables. I print tables. This appears, but I can't understand it. It's an object.

[<google.cloud.bigquery.table.TableListItem object at 0x7f6d79d8c590>]

The output is 1

Exercise 2: explore the summary of the table

It is about how many rows and columns there are and what information they contain, but the question is how many columns in the crime table have timestamp data? I don't know the time stamp. Baidu

A complete and verifiable data that can represent the existence of a piece of data before a specific time. It is usually a character sequence that uniquely identifies the time of a certain moment.

It seems to be used to mark the time of the last update

In this step, I thought I could use tables directly, but I was careless. I need to re reference and obtain tables

Construct a reference to the crimes table

table_ref = dataset_ref.table("crime")

#API request - get table

table = client.get_table(table_ref)

#Print information for all columns in the crimes table in the University of Chicago crimes dataset

print(table.schema)

I didn't see where there was a timestamp

Exercise 3: create a crime map

Topic introduction: if you want to create a map with a point at the location of each crime, you may need to extract the names of two fields from the crime table to draw the crime on the map?

This is to ask me to check the column name (row name) and find out the location

Sure enough, it's a list_rows, the result is longitude and latitude. I also learned that longitude and latitude are used to determine the location of big data.

client.list_rows(table,max_results=5).to_dataframe()

User operation. Display rows (table, number of rows displayed). Display them in the form of data frame

Select/From/Where simple query

Prerequisite, we only use one data set_ Records has only one table called pets

The sql code of this website smells like python. I doubt whether it is in the python environment

In the tutorial, he told us that each user can only scan 5TB of content every month, but I really don't understand. Who can use 5TB?

I thought of a large Internet factory jd. You believe in the power of data. Product data of the order of 100 million is undoubtedly your dream paradise.

5TB must be billions.

This tutorial may be very concerned about the computational power required to extract data. I think it should be the same in the production environment. We should control the use of resources, otherwise we will become hackers and destroy the system.

View the data bytes queried

query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job" 
        """

dry_run_config = bigquery.QueryJobConfig(dry_run=True)

dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed)

Limit the size of the queried data

ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

safe_query_job = client.query(query, job_config=safe_config)

safe_query_job.to_dataframe()

Exercise 1: identify countries that use ppm

This exercise is about air quality

first_query = """
              SELECT country
              FROM `bigquery-public-data.openaq.global_air_quality`
              WHERE unit = "ppm"
              """

Exercise 2:   select * query all fields

"""
                       SELECT *
                       FROM `bigquery-public-data.openaq.global_air_quality`
                       WHERE value = 0
                       """

Group by&Having&Count 

Exercise 1: find out the authors with tens of thousands of comments

SELECT author, COUNT(1) AS NumPosts
        FROM `bigquery-public-data.hacker_news.comments`
        GROUP BY author
        HAVING COUNT(1) > 10000

Check the number of times the author appears, store it in the variable, and then group by according to the author

Exercise 2: query deleted comments

deleted_posts_query ="""
        SELECT  COUNT(1) AS num_deleted_posts
        FROM `bigquery-public-data.hacker_news.comments`
        where deleted = true
        """
​
​
query_job = client.query(deleted_posts_query)
​
deleted_posts = query_job.to_dataframe()
​
# View results
print(deleted_posts)

First, set up a query. The query is deleted (as a logical value) and stored in a variable

Set up a query and invite the client out. Query

Organize the worksheet into df and give it to the final leader (a new variable)

Order by&date&extract

A sort thing, this time the data set comes from the world bank

Exercise 1: inquire about government expenditure on Education

SELECT country_name,AVG(value) AS  avg_ed_spending_pct
                          FROM `bigquery-public-data.world_bank_intl_education.international_education`
                          WHERE indicator_code = "SE.XPD.TOTL.GD.ZS"and year between 2010 and 2017
                          GROUP BY country_name
                          ORDER BY avg_ed_spending_pct DESC

Query the country name and average value. At first, I set the variable to avg_values, but it doesn't work, indicating that the variable name is related to the whole process.

The value I output before tells me that the column name is wrong and does not include the average expenditure

So what is the difference between avg_values and avg_ed_spending_pct?

The case is solved. The title is that I use the specified column name......

Exercise 2: find more interesting code

SELECT indicator_code, indicator_name, COUNT(1) AS num_rows
                   FROM `bigquery-public-data.world_bank_intl_education.international_education`
                   WHERE year = 2016
                   GROUP BY indicator_name, indicator_code
                   HAVING COUNT(1) >= 175
                   ORDER BY COUNT(1) DESC

I don't know what to do...

With & As and common table expressions

CTEs (common table expressions) only exist in the queries that create them and cannot be referenced in future queries. Therefore, any query using CTEs is always divided into two parts: (1) first, we create CTEs, and then (2) we write a query using CTEs.

Returns a temporary table in a query

I think it means that the whole watch is too big. I'll tear it down for use

Exercise 1: how much slower will the traffic be when the traffic volume becomes larger

Find data

First, he told me to enter the client. Then press the Tab key, there will be a command prompt. It's OK.

The first step is to look up the table name

tables = list(client.list_tables(dataset))

for table in tables:
    print(table.table_id)

This is pure python code

Exercise 2: review the first five lines

table_ref = dataset_ref.table("taxi_trips")

table = client.get_table(table_ref)

client.list_rows(table, max_results=5).to_dataframe()

In the old routine, take the table out and put it into the variable, get it with the client, and then list df

Exercise 3: count the total number of trips per year

SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       GROUP BY year
                       ORDER BY year

  The date is processed here with extract

Exercise 4: count the total number of trips per month in 2017

rides_per_month_query = """
                       SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2017
                       GROUP BY month
                       ORDER BY month
                       """ 

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_query_job =client.query(rides_per_month_query,job_config = safe_config) 
​

rides_per_month_result = rides_per_month_query_job.to_dataframe() # Your code goes here
​

print(rides_per_month_result)

  Basically no mistakes. Hahaha, here's a review of python writing,

Write safe_config first (ensure that the query is not wasted)

Please the client. query again

Finally, the output is df

Exercise 5: number and speed per hour

WITH RelevantRides AS
               (
                   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
                          trip_miles, 
                          trip_seconds
                   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE trip_start_timestamp > '2017-01-01' AND 
                         trip_start_timestamp < '2017-07-01' AND 
                         trip_seconds > 0 AND 
                         trip_miles > 0
               )
               SELECT hour_of_day, 
                      COUNT(1) AS num_trips, 
                      3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
               FROM RelevantRides
               GROUP BY hour_of_day
               ORDER BY hour_of_day

First, extract the hour with the public expression, then select the mileage and time, and meet the requirements of date and mileage time

Then select the required data, what time is it, the number of trips, and calculate the travel speed

Then sort by what time

joining data

The opening chapter introduces a website Stack Overflow

Their data is public, and our work is based on this website

Exercise 1: look at the watches

tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables] 

First use the authority of the customer's big brother,

Then traverse and access the table name in tables and write a new list  

Exercise 2: view the table you want to view

tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables] 

Describe it. Ask the customer to list the table in the dataset and store the new variable

Then read it out with the for loop  

Exercise 3: determine the research direction

Preview the question sheet and answer sheet to determine the next research direction

# Build a reference to the reply table
answers_table_ref = dataset_ref.table("posts_answers")

# get data
answers_table = client.get_table(answers_table_ref)

# Preview the first five rows of data
client.list_rows(answers_table, max_results=5).to_dataframe()

Here only one, because it is the same method, this problem is really good.  

Exercise 4: find the answer to the question through the connection

Requires a response related to bigquery to be returned

SELECT a.id, a.body,a.owner_user_id
        FROM `bigquery-public-data.stackoverflow.posts_questions` as q
        INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` as a 
            ON q.id = a.parent_id
        WHERE q.tags LIKE '%bigquery%'

INNER JOIN...ONq.id = a.parent_id (it feels like vlookup)

Exercise 5: count the number of users and answer questions

SELECT  a.owner_user_id AS user_id,COUNT(1) AS number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                GROUP BY a.owner_user_id

I missed GROUP BY myself here   Check it out

  group by grouping is required for both the fields of the table structure itself and the fields that need to use the aggregate function (COUNT(),SUM(),MAX(),MIN(),AVG(), etc.). Grouping is also required when there are fields that need to be calculated by the aggregate function in the query conditions

Exercise 6: write a general function to get experts on any topic    

def expert_finder(topic, client):
    '''
    Returns a list of users with user names who write answers to specified topics on the website .

    input:
        topic:Topic to query
        client: Client

    Outputs:
        results: A list with user names and answers.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """
   
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    results = my_query_job.to_dataframe()

    return results

    The whole function. Similar structure, the difference is that the function can be called at any time without writing now.

Topics: Database SQL