2024.03.20 11:00

KaggleでSQL-PFへの道④練習問題前半（The Road to SQL Portfolios with Kaggle(4)First half of the exercise）

コード訳が手間なので直で載せます（やめるの早）。ブログは記録用なので③からの続きで載せていきます。本日のメニューはこちら。テーブルプレビュー後、練習問題にはいります。

Kaggle Notebook内でテーブルプレビューを表示
練習問題(1)データセットの表を数える

データはbigquery-public-data内「chicago_crime_data」を、個人データセットに連結したものを使用しています。直で公開データにしなかったのは、個人SQLにつなげてみたかったからです。両者の違いは最初の接続と、SQLのFROM部分だけかと思ってる。

１．Kaggle Notebook内でテーブルプレビューを表示

この chicago_crime_data の中身を頭から５行だけプレビューします。

bigquery_client.list_rows(table, max_results=5).to_dataframe()

Pythonの書き方なんですね（わかってない）。

# List all the tables in the "chicago_crime_data" dataset
tables = list(bigquery_client.list_tables('chicago_crime_data'))
# Print names of all tables in the dataset
for table in tables:
print(table.table_id)

# Preview the first five lines of the "crime" table
bigquery_client.list_rows(table, max_results=5).to_dataframe()

こちらも同じ結果

# API request - fetch the table
table = client.get_table(table_ref)
# Preview the first five lines of the "chicago_crime_data" table
client.list_rows(table, max_results=5).to_dataframe()

結果

横広。イメージ画像なので読めないくらいがいいか。district が「...」で略されてるのは、中身がないから？

２．練習問題(1)データセットの表を数える

from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client("angular-radar-411310")

query = """
SELECT COUNT(case_number) AS total_people
FROM `angular-radar-411310.chicago_crime_data.crime`
"""

query_job = client.query(query) # Make an API request.
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("num_tables={}".format(row[0]))

下線部が私の個人SQLデータセット、「query = """」～「"""」がSQL部分です。

query_job = client.query(query) # Make an API request.

変数一個しか入れられないのかな？複数だと配列にする？そこまでしてPythonで表示せなんかな。結果テーブル見せたほうが早いかもしらん。

結果

Please ensure you specify a project id when creating the client in order to use your BigQuery account.
The query data:
num_tables=8013073

計算して出たのは下線部です。あれ、これ答えはどこにあるんだろう。

GoogleのSQLテーブル情報が8013073だから、合ってはいる。

別法

# API request - run the query, and return a pandas DataFrame
query_job = client.query(query) # Make an API request.
us_cities = query_job.to_dataframe()
us_cities.total_people.value_counts().head()

結果

total_people
8013073 1Name: count, dtype: int64

この表示の仕方…こっちだとテーブル見せられそう。

ポートフォリオの見せ方

これからはSQL部分とPython部分が混ざってくるので、書き方を工夫しないとです。SQLポートフォリオを表示したいから、できるだけコードがいいのか、できた結果だけがいいのか。経緯がないとスキルが伝わらないだろうか。

表示のためにPythonで包むことで長くなるならSQLでテーブル完成させてNotebookは表示に特化したほうがいいのかな。

SQL長かったりサブクエリ使ったものがPythonでどこまで表示できるのかやってみないとわからない。

注）このシリーズはKaggle内エクササイズ「Getting Started With SQL & BigQuery」に沿って進めています。しかし記事内で使われている便利ツール「bq_helper」は廃止予定のため、新しい接続コマンドでKaggle内にPython-SQLを表示させています。

＜参考＞

Kaggle:「Exercise: Getting Started With SQL & BigQuery」

Select, From & Where（The foundational compontents for all SQL queries）

********

I wrote the code directly, because translating the code is time-consuming (I gave up on Japanese translation too early). The blog is for my records, so I will continue from (3).

Today's menu is here. After the table preview, we will move on to the exercises.

Show table preview in Kaggle Notebook
Exercise (1) Counting tables in a data set

The data is from bigquery-public-data, "chicago_crime_data", concatenated into a personal dataset. I did not use the public data directly because I thought I could use personal SQL to express my own SQL in the future. I think the difference between the two is the initial join and the 'FROM' clause in the SQL.

1. Show table preview in Kaggle Notebook

Preview the contents of this chicago_crime_data with only the first five lines.

bigquery_client.list_rows(table, max_results=5).to_dataframe()

# List all the tables in the "chicago_crime_data" dataset
tables = list(bigquery_client.list_tables('chicago_crime_data'))
# Print names of all tables in the dataset
for table in tables:
print(table.table_id)
# Preview the first five lines of the "crime" table
bigquery_client.list_rows(table, max_results=5).to_dataframe()

Same results here.

# API request - fetch the table
table = client.get_table(table_ref)
# Preview the first five lines of the "chicago_crime_data" table
client.list_rows(table, max_results=5).to_dataframe()

2. Exercise (1) Counting tables in a data set

from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client("angular-radar-411310")
query = """
SELECT COUNT(case_number) AS total_people
FROM `angular-radar-411310.chicago_crime_data.crime`
"""
query_job = client.query(query) # Make an API request.
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("num_tables={}".format(row[0]))

The underlined part is my personal SQL dataset, and "query = """-""" is the SQL part.

query_job = client.query(query) # Make an API request.

Can I only put in one variable? If there's more than one, do we use an array? I wonder if I have to go that far and display it in Python. It might be faster to show the result table.

Results

Please ensure you specify a project id when creating the client in order to use your BigQuery account.
The query data:
num_tables=8013073

What I calculated and came up with is the underlined part. Hey, where is the answer to this?

Google's SQL table information is 8013073, so it is correct.

Different method

# API request - run the query, and return a pandas DataFrame
query_job = client.query(query) # Make an API request.
us_cities = query_job.to_dataframe()
us_cities.total_people.value_counts().head()

Results

total_people
8013073 1Name: count, dtype: int64

The way this is displayed...over here, it looks like they could show the table.

How to show the portfolio

I'm going to mix the SQL part with the Python part, so I need to work out how to write it: do I want to show the SQL portfolio, so I want as much code as possible, or just the results that were done? Would I be able to convey my skills without the background?

If wrapping it in Python for display makes it longer, I wonder if it would be better to complete the table in SQL and let the notebook focus on display.

I don't know how much of a long SQL or subquery can be displayed in Python until I try it.

Note: This series follows the Getting Started with SQL & BigQuery exercise in Kaggle. However, the "bq_helper" utility used in the article is deprecated, so a new connection command is used to display Python-SQL inside Kaggle.

<Reference

Kaggle:「Exercise: Getting Started With SQL & BigQuery」

Select, From & Where（The foundational compontents for all SQL queries）

DATA idm8

Aim for a comprehensive analysis. Data-informed decision making. データ分析／著作権・知的財産マネジメント