catalogue
Exercise 1: count the number of tables in the dataset
Exercise 2: explore the summary of the table
Exercise 3: create a crime map
Select/From/Where simple query
Exercise 1: identify countries that use ppm
Exercise 2: select * query all fields
Exercise 1: find out the authors with tens of thousands of comments
Exercise 2: query deleted comments
Exercise 1: inquire about government expenditure on Education
Exercise 2: find more interesting code
With & As and common table expressions
Exercise 1: how much slower will the traffic be when the traffic volume becomes larger
Exercise 2: review the first five lines
Exercise 3: count the total number of trips per year
Exercise 4: count the total number of trips per month in 2017
Exercise 5: number and speed per hour
Exercise 1: look at the watches
Exercise 2: view the table you want to view
Exercise 3: determine the research direction
Exercise 4: find the answer to the question through the connection
Exercise 5: count the number of users and answer questions
Exercise 6: write a general function to get experts on any topic
Start sql journey
The first test of your new data exploration skills uses data describing crime in Chicago.
1. Create a client object to retrieve information
The tutorial says that you will soon see that this client object will play a central role in retrieving information from the BigQuery dataset.
I think it's fascinating anyway
client=bigquery.Client()
2. First, use the dataset () method to construct a reference to the dataset.
dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")
3. Next, we use get_ The dataset () method and the reference just constructed to get the dataset
dataset = client.get_dataset(dataset_ref)
Exercise 1: count the number of tables in the dataset
tables = list(client.list_tables(dataset)) print(len(tables))
Here you can see the list of datasets_ Tables is not what I expected, but client. At present, it can be understood as client operation? Then I listed it again. It's hard to understand. Recite it first. I checked the length of tables in print. Here, I can't guess what data format tables is. It may be that tables received a list of datasets, so now there should be a list name arrangement in tables. I print tables. This appears, but I can't understand it. It's an object.
[<google.cloud.bigquery.table.TableListItem object at 0x7f6d79d8c590>]
The output is 1
Exercise 2: explore the summary of the table
It is about how many rows and columns there are and what information they contain, but the question is how many columns in the crime table have timestamp data? I don't know the time stamp. Baidu
A complete and verifiable data that can represent the existence of a piece of data before a specific time. It is usually a character sequence that uniquely identifies the time of a certain moment.
It seems to be used to mark the time of the last update
In this step, I thought I could use tables directly, but I was careless. I need to re reference and obtain tables
Construct a reference to the crimes table
table_ref = dataset_ref.table("crime")
#API request - get table
table = client.get_table(table_ref)
#Print information for all columns in the crimes table in the University of Chicago crimes dataset
print(table.schema)
I didn't see where there was a timestamp
Exercise 3: create a crime map
Topic introduction: if you want to create a map with a point at the location of each crime, you may need to extract the names of two fields from the crime table to draw the crime on the map?
This is to ask me to check the column name (row name) and find out the location
Sure enough, it's a list_rows, the result is longitude and latitude. I also learned that longitude and latitude are used to determine the location of big data.
client.list_rows(table,max_results=5).to_dataframe()
User operation. Display rows (table, number of rows displayed). Display them in the form of data frame
Select/From/Where simple query
Prerequisite, we only use one data set_ Records has only one table called pets
The sql code of this website smells like python. I doubt whether it is in the python environment
In the tutorial, he told us that each user can only scan 5TB of content every month, but I really don't understand. Who can use 5TB?
I thought of a large Internet factory jd. You believe in the power of data. Product data of the order of 100 million is undoubtedly your dream paradise.
5TB must be billions.
This tutorial may be very concerned about the computational power required to extract data. I think it should be the same in the production environment. We should control the use of resources, otherwise we will become hackers and destroy the system.
View the data bytes queried
query = """ SELECT score, title FROM `bigquery-public-data.hacker_news.full` WHERE type = "job" """ dry_run_config = bigquery.QueryJobConfig(dry_run=True) dry_run_query_job = client.query(query, job_config=dry_run_config) print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed)
Limit the size of the queried data
ONE_MB = 1000*1000 safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB) safe_query_job = client.query(query, job_config=safe_config) safe_query_job.to_dataframe()
Exercise 1: identify countries that use ppm
This exercise is about air quality
first_query = """ SELECT country FROM `bigquery-public-data.openaq.global_air_quality` WHERE unit = "ppm" """
Exercise 2: select * query all fields
""" SELECT * FROM `bigquery-public-data.openaq.global_air_quality` WHERE value = 0 """
Group by&Having&Count
Exercise 1: find out the authors with tens of thousands of comments
SELECT author, COUNT(1) AS NumPosts FROM `bigquery-public-data.hacker_news.comments` GROUP BY author HAVING COUNT(1) > 10000
Check the number of times the author appears, store it in the variable, and then group by according to the author
Exercise 2: query deleted comments
deleted_posts_query =""" SELECT COUNT(1) AS num_deleted_posts FROM `bigquery-public-data.hacker_news.comments` where deleted = true """ query_job = client.query(deleted_posts_query) deleted_posts = query_job.to_dataframe() # View results print(deleted_posts)
First, set up a query. The query is deleted (as a logical value) and stored in a variable
Set up a query and invite the client out. Query
Organize the worksheet into df and give it to the final leader (a new variable)
Order by&date&extract
A sort thing, this time the data set comes from the world bank
Exercise 1: inquire about government expenditure on Education
SELECT country_name,AVG(value) AS avg_ed_spending_pct FROM `bigquery-public-data.world_bank_intl_education.international_education` WHERE indicator_code = "SE.XPD.TOTL.GD.ZS"and year between 2010 and 2017 GROUP BY country_name ORDER BY avg_ed_spending_pct DESC
Query the country name and average value. At first, I set the variable to avg_values, but it doesn't work, indicating that the variable name is related to the whole process.
The value I output before tells me that the column name is wrong and does not include the average expenditure
So what is the difference between avg_values and avg_ed_spending_pct?
The case is solved. The title is that I use the specified column name......
Exercise 2: find more interesting code
SELECT indicator_code, indicator_name, COUNT(1) AS num_rows FROM `bigquery-public-data.world_bank_intl_education.international_education` WHERE year = 2016 GROUP BY indicator_name, indicator_code HAVING COUNT(1) >= 175 ORDER BY COUNT(1) DESC
I don't know what to do...
With & As and common table expressions
CTEs (common table expressions) only exist in the queries that create them and cannot be referenced in future queries. Therefore, any query using CTEs is always divided into two parts: (1) first, we create CTEs, and then (2) we write a query using CTEs.
Returns a temporary table in a query
I think it means that the whole watch is too big. I'll tear it down for use
Exercise 1: how much slower will the traffic be when the traffic volume becomes larger
Find data
First, he told me to enter the client. Then press the Tab key, there will be a command prompt. It's OK.
The first step is to look up the table name
tables = list(client.list_tables(dataset)) for table in tables: print(table.table_id)
This is pure python code
Exercise 2: review the first five lines
table_ref = dataset_ref.table("taxi_trips") table = client.get_table(table_ref) client.list_rows(table, max_results=5).to_dataframe()
In the old routine, take the table out and put it into the variable, get it with the client, and then list df
Exercise 3: count the total number of trips per year
SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year, COUNT(1) AS num_trips FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` GROUP BY year ORDER BY year
The date is processed here with extract
Exercise 4: count the total number of trips per month in 2017
rides_per_month_query = """ SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, COUNT(1) AS num_trips FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2017 GROUP BY month ORDER BY month """ safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10) rides_per_month_query_job =client.query(rides_per_month_query,job_config = safe_config) rides_per_month_result = rides_per_month_query_job.to_dataframe() # Your code goes here print(rides_per_month_result)
Basically no mistakes. Hahaha, here's a review of python writing,
Write safe_config first (ensure that the query is not wasted)
Please the client. query again
Finally, the output is df
Exercise 5: number and speed per hour
WITH RelevantRides AS ( SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, trip_miles, trip_seconds FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` WHERE trip_start_timestamp > '2017-01-01' AND trip_start_timestamp < '2017-07-01' AND trip_seconds > 0 AND trip_miles > 0 ) SELECT hour_of_day, COUNT(1) AS num_trips, 3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph FROM RelevantRides GROUP BY hour_of_day ORDER BY hour_of_day
First, extract the hour with the public expression, then select the mileage and time, and meet the requirements of date and mileage time
Then select the required data, what time is it, the number of trips, and calculate the travel speed
Then sort by what time
joining data
The opening chapter introduces a website Stack Overflow
Their data is public, and our work is based on this website
Exercise 1: look at the watches
tables = list(client.list_tables(dataset)) list_of_tables = [table.table_id for table in tables]
First use the authority of the customer's big brother,
Then traverse and access the table name in tables and write a new list
Exercise 2: view the table you want to view
tables = list(client.list_tables(dataset)) list_of_tables = [table.table_id for table in tables]
Describe it. Ask the customer to list the table in the dataset and store the new variable
Then read it out with the for loop
Exercise 3: determine the research direction
Preview the question sheet and answer sheet to determine the next research direction
# Build a reference to the reply table answers_table_ref = dataset_ref.table("posts_answers") # get data answers_table = client.get_table(answers_table_ref) # Preview the first five rows of data client.list_rows(answers_table, max_results=5).to_dataframe()
Here only one, because it is the same method, this problem is really good.
Exercise 4: find the answer to the question through the connection
Requires a response related to bigquery to be returned
SELECT a.id, a.body,a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_questions` as q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` as a ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%'
INNER JOIN...ONq.id = a.parent_id (it feels like vlookup)
Exercise 5: count the number of users and answer questions
SELECT a.owner_user_id AS user_id,COUNT(1) AS number_of_answers FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%' GROUP BY a.owner_user_id
I missed GROUP BY myself here Check it out
group by grouping is required for both the fields of the table structure itself and the fields that need to use the aggregate function (COUNT(),SUM(),MAX(),MIN(),AVG(), etc.). Grouping is also required when there are fields that need to be calculated by the aggregate function in the query conditions
Exercise 6: write a general function to get experts on any topic
def expert_finder(topic, client): ''' Returns a list of users with user names who write answers to specified topics on the website . input: topic:Topic to query client: Client Outputs: results: A list with user names and answers. ''' my_query = """ SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_Id WHERE q.tags like '%{topic}%' GROUP BY a.owner_user_id """ safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10) my_query_job = client.query(my_query, job_config=safe_config) results = my_query_job.to_dataframe() return results
The whole function. Similar structure, the difference is that the function can be called at any time without writing now.