KaggleでSQL-PFへの道⑤練習問題後半(The Road to SQL Portfolios with Kaggle(5)Second half of the exercise)

結局何が手間かって、エクササイズー解答の場所を把握してないから、答え合わせができず、自分のSQLで再確認する羽目になるという。SQL部分はコピペなので良いけど。NULL処理とかほかの知見を得たい。

難航している点、その1.bq-helperの終焉(これが一番難儀してる。先人の知恵を使えない)、その2.Python初心者(必要なコマンドだけ覚えたい)、その3.Kaggle初心者(コンペ会場の端っこに間借りして自社ブース作ろうとするけど電源の位置から不便してる体)。Notebookで40分経ったら再ALL RUNも地味にストレス…。うっかり一部分だけで回すとエラー出るからあまり区切りたくない。Kaggle内で長いSQL回すのは厳しいかもなー。

本日のメニューはこちら。

  1. 重複削除カウントと単純カウント
  2. 地図を作るために必要な情報は→スキーマから判断する


1. 重複カウントと単純カウント

エクササイズ内練習問題:(2)TIMESTAMPデータを持っている列はいくつですか?

この問い、勘違いして以下の処理してしまった。スキーマ見ればTIMESTAMP列が2つ「date」「updated_on」であることはすぐわかるけど、これをクエリで書くんかな?

たとえば列が膨大だとクエリで型を探すこともあるよってこと?そんな膨大なデータベースはリレーショナルに再構築だ。

bqutil.fn.typeof() 関数で型を扱えそう。が、行情報をカウントして出すとは... 。STRINGがいくつあるか、って使うのかな。ピンポイントでの型変換はよく使うけども。

微妙なので勘違いのまま解答記述。

解答

query = """
SELECT COUNT(DISTINCT date) AS d_timestamp,
COUNT(date) AS timestamp

FROM `angular-radar-411310.chicago_crime_data.crime`

WHERE date IS NOT NULL


"""

query_job = client.query(query) # Make an API request.

print("Answer:")

for row in query_job:

print("individual_count={}, count={}".format(row["d_timestamp"], row["timestamp"]))

NULL処理は不要でした(WHERE句なくても結果一緒)。通常はこんなきれいなデータなかろう。

結果

Answer:
individual_count=2703714, count=8019976

答え合わせがその場でできないなー。

individual_count が重複削除しての答え、countがそのままカウント。なので全問と回答は同じになる。一応自分のSQLで確認して記述してるけど、結局やってることは一緒なのだ。

Pythonのprintコマンドが手間になってきた…いや、ただKaggleで表示させる最低限は知っとかないと、いずれR ver.も書きたいし。


2.地図を作るために必要な情報は→スキーマから判断する

エクササイズ内練習問題:(3) 各犯罪の発生地点に点が表示された地図を作成する場合、犯罪を地図上にプロットするためにデータベースから取り出す必要のある2つのフィールドの名前は何でしょうか?

いったん再スキーマ表示

# Print information on all the columns in the "full" table in the "chicago_crime_data" dataset
table.schema

結果(見にくい)

数が多い。実務は必要な列だけ抜き出し再構築という手間をかませたほうがいいかな。

この情報の中で地図を作るために必要なのは…

# Preview the first five entries in the "15-22" column of the "full" table
client.list_rows(table, selected_fields=table.schema[15:22], max_results=5).to_dataframe()

結果

位置情報はこのあたりですね。テーブルの15~22列、頭5行を表示しています。coordinateも位置のようだけど…問題からすると「latitude」「longitude」が回答ではないかと。

locationもありますが、表現するツールによってはここから分割する必要があるので、やっぱり前述の2つで。



*******


What ends up being a hassle is. I don't know where to find the corresponding answers to the exercises, so I can't match the answers and have to check them with my own SQL.

The SQL part is copy and paste, so I haven't rewritten it, but it's a waste of time.

If there were example solutions, I would like to gain other knowledge, such as NULL processing.


Points of difficulties,

(a) the demise of bq-helper (this is the hardest part. I can't use the wisdom of my predecessors).

(b) python newbie (I just want to learn the commands I need).

(c) kaggle newbie (I feel like I'm renting a room on the outskirts of the competition hall).

Re-ALL RUN after 40 minutes in the notebook is also stressful... I don't want to separate the data too much, because I get errors if I accidentally run only part of the data... It might be difficult to run long SQLs in Kaggle...

Today's menu is.

  1. Duplicate deletion counts and simple counts
  2. Information needed for mapping is determined from the → schema.


1. Duplicate deletion counts and simple counts

Practice Questions:(2) How many columns in the crime table have TIMESTAMP data?

I mishandled. I misunderstood.

Looking at the schema, I can easily see that there are two TIMESTAMP columns "date" and "updated_on", but do I write this in a query?

For example, if the columns are huge, does that mean you have to look for the type in the query? Such a huge database is a relational reconstruction.

The function bqutil.fn.typeof() could handle the types. But it can't be used in this case. To get the type information of a row in a horizontal count, independent of the information... I wonder if it is used to find out how many STRINGs there are. What is the schema for? I often use pinpoint type conversion, though.

Solution

query = """
SELECT COUNT(DISTINCT date) AS d_timestamp,
COUNT(date) AS timestamp
FROM `angular-radar-411310.chicago_crime_data.crime`
WHERE date IS NOT NULL
"""
query_job = client.query(query) # Make an API request.
print("Answer:")
for row in query_job:
print("individual_count={}, count={}".format(row["d_timestamp"], row["timestamp"]))

NULL processing was not required (the result is the same without the WHERE clause).

Having said that, the data will not normally be this clean to begin with.

Result

Answer:
individual_count=2703714, count=8019976

I wonder where the official answer is written.

individual_count is the answer with duplicates removed, count is the count as it is. I'm checking and writing in my SQL, but in the end what I'm doing is the same.

The Python print command is becoming a pain... No, I just need to know the minimum to display it in Kaggle and I want to write an R version eventually.


2. Information needed for mapping is determined from the → schema.

Practice Questions:(3) If you wanted to create a map with a dot at the location of each crime, what are the names of the two fields you likely need to pull out of the database to plot the crimes on a map?

Re-display schema

# Print information on all the columns in the "full" table in the "chicago_crime_data" dataset
table.schema

Result

There are too many. I'd prefer to extract only the columns I need and reconstruct the actual work. In order to create a map from this information, all that is needed is...

# Preview the first five entries in the "15-22" column of the "full" table
client.list_rows(table, selected_fields=table.schema[15:22], max_results=5).to_dataframe()

Result

The location information is around here. Columns 15-22 and the first five rows of the table. 'coordinate' also seems to be a location... but from the question, 'latitude' and 'longitude' are probably the answers.

There is also 'location', but depending on the tool used to represent it, it may need to be split from here, so I still went with the two mentioned above.