# sql learning micro course on kaggle website

Posted by bex1111 on Wed, 15 Sep 2021 02:27:04 +0200

catalogue

Start sql journey

Exercise 1: count the number of tables in the dataset

Exercise 2: explore the summary of the table

Exercise 3: create a crime map

Select/From/Where simple query

Exercise 1: identify countries that use ppm

Exercise 2:   select * query all fields

Group by&Having&Count

Exercise 1: find out the authors with tens of thousands of comments

Order by&date&extract

Exercise 1: inquire about government expenditure on Education

Exercise 2: find more interesting code

With & As and common table expressions

Exercise 1: how much slower will the traffic be when the traffic volume becomes larger

Exercise 2: review the first five lines

Exercise 3: count the total number of trips per year

Exercise 4: count the total number of trips per month in 2017

Exercise 5: number and speed per hour

joining data

Exercise 1: look at the watches

Exercise 2: view the table you want to view

Exercise 3: determine the research direction

Exercise 4: find the answer to the question through the connection

Exercise 5: count the number of users and answer questions

Exercise 6: write a general function to get experts on any topic

# Start sql journey

The first test of your new data exploration skills uses data describing crime in Chicago.

1. Create a client object to retrieve information

The tutorial says that you will soon see that this client object will play a central role in retrieving information from the BigQuery dataset.

I think it's fascinating anyway

`client=bigquery.Client()`

2. First, use the dataset () method to construct a reference to the dataset.

`dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")`

3. Next, we use get_ The dataset () method and the reference just constructed to get the dataset

`dataset = client.get_dataset(dataset_ref)`

## Exercise 1: count the number of tables in the dataset

```tables = list(client.list_tables(dataset))

print(len(tables))```

Here you can see the list of datasets_ Tables is not what I expected, but client. At present, it can be understood as client operation? Then I listed it again. It's hard to understand. Recite it first. I checked the length of tables in print. Here, I can't guess what data format tables is. It may be that tables received a list of datasets, so now there should be a list name arrangement in tables. I print tables. This appears, but I can't understand it. It's an object.

`[<google.cloud.bigquery.table.TableListItem object at 0x7f6d79d8c590>]`

The output is 1

## Exercise 2: explore the summary of the table

It is about how many rows and columns there are and what information they contain, but the question is how many columns in the crime table have timestamp data? I don't know the time stamp. Baidu

A complete and verifiable data that can represent the existence of a piece of data before a specific time. It is usually a character sequence that uniquely identifies the time of a certain moment.

It seems to be used to mark the time of the last update

In this step, I thought I could use tables directly, but I was careless. I need to re reference and obtain tables

Construct a reference to the crimes table

`table_ref = dataset_ref.table("crime")`

#API request - get table

`table = client.get_table(table_ref)`

#Print information for all columns in the crimes table in the University of Chicago crimes dataset

`print(table.schema)`

I didn't see where there was a timestamp

## Exercise 3: create a crime map

Topic introduction: if you want to create a map with a point at the location of each crime, you may need to extract the names of two fields from the crime table to draw the crime on the map?

This is to ask me to check the column name (row name) and find out the location

Sure enough, it's a list_rows, the result is longitude and latitude. I also learned that longitude and latitude are used to determine the location of big data.

`client.list_rows(table,max_results=5).to_dataframe()`

User operation. Display rows (table, number of rows displayed). Display them in the form of data frame

# Select/From/Where simple query

Prerequisite, we only use one data set_ Records has only one table called pets

The sql code of this website smells like python. I doubt whether it is in the python environment

In the tutorial, he told us that each user can only scan 5TB of content every month, but I really don't understand. Who can use 5TB?

I thought of a large Internet factory jd. You believe in the power of data. Product data of the order of 100 million is undoubtedly your dream paradise.

5TB must be billions.

This tutorial may be very concerned about the computational power required to extract data. I think it should be the same in the production environment. We should control the use of resources, otherwise we will become hackers and destroy the system.

View the data bytes queried

```query = """
SELECT score, title
FROM `bigquery-public-data.hacker_news.full`
WHERE type = "job"
"""

dry_run_config = bigquery.QueryJobConfig(dry_run=True)

dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed)```

Limit the size of the queried data

```ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

safe_query_job = client.query(query, job_config=safe_config)

safe_query_job.to_dataframe()```

## Exercise 1: identify countries that use ppm

This exercise is about air quality

```first_query = """
SELECT country
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE unit = "ppm"
"""```

## Exercise 2:   select * query all fields

```"""
SELECT *
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE value = 0
"""```

# Group by&Having&Count

## Exercise 1: find out the authors with tens of thousands of comments

```SELECT author, COUNT(1) AS NumPosts
GROUP BY author
HAVING COUNT(1) > 10000```

Check the number of times the author appears, store it in the variable, and then group by according to the author

## Exercise 2: query deleted comments

```deleted_posts_query ="""
SELECT  COUNT(1) AS num_deleted_posts
where deleted = true
"""
​
​
query_job = client.query(deleted_posts_query)
​
deleted_posts = query_job.to_dataframe()
​
# View results
print(deleted_posts)```

First, set up a query. The query is deleted (as a logical value) and stored in a variable

Set up a query and invite the client out. Query

Organize the worksheet into df and give it to the final leader (a new variable)

# Order by&date&extract

A sort thing, this time the data set comes from the world bank

## Exercise 1: inquire about government expenditure on Education

```SELECT country_name,AVG(value) AS  avg_ed_spending_pct
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_code = "SE.XPD.TOTL.GD.ZS"and year between 2010 and 2017
GROUP BY country_name
ORDER BY avg_ed_spending_pct DESC```

Query the country name and average value. At first, I set the variable to avg_values, but it doesn't work, indicating that the variable name is related to the whole process.

The value I output before tells me that the column name is wrong and does not include the average expenditure

So what is the difference between avg_values and avg_ed_spending_pct?

The case is solved. The title is that I use the specified column name......

## Exercise 2: find more interesting code

```SELECT indicator_code, indicator_name, COUNT(1) AS num_rows
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE year = 2016
GROUP BY indicator_name, indicator_code
HAVING COUNT(1) >= 175
ORDER BY COUNT(1) DESC```

I don't know what to do...

# With & As and common table expressions

CTEs (common table expressions) only exist in the queries that create them and cannot be referenced in future queries. Therefore, any query using CTEs is always divided into two parts: (1) first, we create CTEs, and then (2) we write a query using CTEs.

Returns a temporary table in a query

I think it means that the whole watch is too big. I'll tear it down for use

## Exercise 1: how much slower will the traffic be when the traffic volume becomes larger

Find data

First, he told me to enter the client. Then press the Tab key, there will be a command prompt. It's OK.

The first step is to look up the table name

```tables = list(client.list_tables(dataset))

for table in tables:
print(table.table_id)```

This is pure python code

## Exercise 2: review the first five lines

```table_ref = dataset_ref.table("taxi_trips")

table = client.get_table(table_ref)

client.list_rows(table, max_results=5).to_dataframe()```

In the old routine, take the table out and put it into the variable, get it with the client, and then list df

## Exercise 3: count the total number of trips per year

```SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year,
COUNT(1) AS num_trips
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
GROUP BY year
ORDER BY year```

The date is processed here with extract

## Exercise 4: count the total number of trips per month in 2017

```rides_per_month_query = """
SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month,
COUNT(1) AS num_trips
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2017
GROUP BY month
ORDER BY month
"""

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_query_job =client.query(rides_per_month_query,job_config = safe_config)
​

rides_per_month_result = rides_per_month_query_job.to_dataframe() # Your code goes here
​

print(rides_per_month_result)```

Basically no mistakes. Hahaha, here's a review of python writing,

Write safe_config first (ensure that the query is not wasted)

Finally, the output is df

## Exercise 5: number and speed per hour

```WITH RelevantRides AS
(
SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day,
trip_miles,
trip_seconds
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_start_timestamp > '2017-01-01' AND
trip_start_timestamp < '2017-07-01' AND
trip_seconds > 0 AND
trip_miles > 0
)
SELECT hour_of_day,
COUNT(1) AS num_trips,
3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
FROM RelevantRides
GROUP BY hour_of_day
ORDER BY hour_of_day```

First, extract the hour with the public expression, then select the mileage and time, and meet the requirements of date and mileage time

Then select the required data, what time is it, the number of trips, and calculate the travel speed

Then sort by what time

# joining data

The opening chapter introduces a website Stack Overflow

Their data is public, and our work is based on this website

## Exercise 1: look at the watches

```tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables] ```

First use the authority of the customer's big brother,

Then traverse and access the table name in tables and write a new list

## Exercise 2: view the table you want to view

```tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]
```

Describe it. Ask the customer to list the table in the dataset and store the new variable

Then read it out with the for loop

## Exercise 3: determine the research direction

Preview the question sheet and answer sheet to determine the next research direction

```# Build a reference to the reply table

# get data

# Preview the first five rows of data

Here only one, because it is the same method, this problem is really good.

## Exercise 4: find the answer to the question through the connection

Requires a response related to bigquery to be returned

```SELECT a.id, a.body,a.owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions` as q
ON q.id = a.parent_id
WHERE q.tags LIKE '%bigquery%'```

INNER JOIN...ONq.id = a.parent_id (it feels like vlookup)

## Exercise 5: count the number of users and answer questions

```SELECT  a.owner_user_id AS user_id,COUNT(1) AS number_of_answers
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
ON q.id = a.parent_id
WHERE q.tags LIKE '%bigquery%'
GROUP BY a.owner_user_id```

I missed GROUP BY myself here   Check it out

group by grouping is required for both the fields of the table structure itself and the fields that need to use the aggregate function (COUNT(),SUM(),MAX(),MIN(),AVG(), etc.). Grouping is also required when there are fields that need to be calculated by the aggregate function in the query conditions

## Exercise 6: write a general function to get experts on any topic

```def expert_finder(topic, client):
'''
Returns a list of users with user names who write answers to specified topics on the website .

input:
topic:Topic to query
client: Client

Outputs:
results: A list with user names and answers.
'''
my_query = """
SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
ON q.id = a.parent_Id
WHERE q.tags like '%{topic}%'
GROUP BY a.owner_user_id
"""

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
my_query_job = client.query(my_query, job_config=safe_config)

results = my_query_job.to_dataframe()

return results```

The whole function. Similar structure, the difference is that the function can be called at any time without writing now.

Topics: Database SQL