2024.02.29 14:55

KaggleでSQL-PFへの道③スキーマ（The Road to SQL Portfolios with Kaggle(3)schema）

まずはKaggleノートブックから個人のSQLへアクセスしてみます。ノートブックから自分のSQLに接続して、そこにあるひとつのテーブルのスキーマを表示するまで。まだPython慣れない。私の学生時代はC++でした（既に覚えてない）。表示＆テーブル出力は今後の検証対象に。あとKaggleは英語記述だから、ここでの経過は日本語訳で載せます。ブログ後半の英語表記が原本です。手間になったら訳やめよう。

画像だとレスポンシブル表示が不便なので長いとこは省略しつつ、転載部分は引用符で書きます。ノートブックの+CodeボタンでPython記述しているところは、「＞＞＞+Codeでの記述＜＜＜」を追加しています。（実際のコード内コメントアウトは「#」です）

冒頭テンプレ
Google Big Query とつなぐ
pipインストール
衝突エラーについて
自分のデータセットを覗く
テーブルのスキーマを表示させる

１．冒頭テンプレ

ノートブック開くと自動で入ってくるチュートリアル。慣れた方は削るんでしょうね。←訳すまでちゃんと読んでなかった

# このPython 3環境には、多くの有用な分析ライブラリがインストールされています。
# kaggle/python Dockerイメージで定義されています: https://github.com/kaggle/docker-python
# 例えば、以下のような有用なパッケージがロードされています。
import numpy as np # 線形代数
import pandas as pd # データ処理、CSVファイル入出力 (例: pd.read_csv)
--- 中略 ---
# カレントディレクトリ(/kaggle/working/)に最大20GBまで書き込むことができ、"Save & Run All "を使ってバージョンを作成した際に出力として保存されます。
# /kaggle/temp/に一時ファイルを書き込むこともできますが、それらは現在のセッションの外には保存されません。

自動で入ってくるのでありがたく触らずスルー。

２．Google Big Query とつなぐ

個人SQLを呼び出します。#の行はコメントアウトの説明部分です。

＞＞＞+Codeでの記述＜＜＜
# BigQuery
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='●●●●●')

●●●●●が、個人のプロジェクトIDです。GooglecloudでSQLワークスペースを使うとき、最初のプロジェクト名は「My First Project」です。これに固有のIDが付きます。たぶん無料だと作れるプロジェクトは３つまでだったような。配下にデータセットとテーブルがぶらさがっていきます。

Python、出力命令しないと結果出してくれないから、上記のコード書いて実行しても返事ありません。SQLに慣れてると肩透かしな気分。反応が欲しかったら

＞＞＞+Codeでの記述＜＜＜
print("Setup Complete")

とか入れておけば返事が来ます。

３．pipインストール

＞＞＞+Codeでの記述＜＜＜
pip install --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]'

ここは前回②でも書きましたが、Shellでの最新版アップデートとつながってるような気がする。ちなみに、pandas のないPython 用 BigQuery クライアントライブラリのインストールコードは以下です。

＞＞＞+Codeでの記述（pandasいらない人向け）＜＜＜
pip install --upgrade google-cloud-bigquery

私はpandasほしいので最初ので。

＞＞＞結果＜＜＜
既に満たされた要件: /opt/conda/lib/python3.10/site-packages (0.21.0) の pandas-gbq
すでに満たされている要件: /opt/conda/lib/python3.10/site-packages (3.17.2) の google-cloud-bigquery[bqstorage,pandas] (google-cloud-bigquery[bqstorage,pandas])
既に満たされた要件: setuptools in /opt/conda/lib/python3.10/site-packages (from pandas-gbq) (68.1.2)
--- 以下略 ---

ダララっと結果が出ます。反応多いとちょっと自分がすごいことしたみたいな気分になりますが錯覚です。インストール頑張ってくれただけ。

４．衝突エラーについて

Notebook を開いて、最初にコード実行すると、pip部分でエラーがでる。

ERROR: pipの依存関係解決ツールは、現在インストールされているすべてのパッケージを考慮していません。この振る舞いが、以下の依存関係の衝突の原因となっています。
beatrix-jupyterlab 2023.814.150030はjupyter-server~=1.16を必要としますが、jupyter-server 2.12.1があり、互換性がありません。
beatrix-jupyterlab 2023.814.150030にはjupyterlab~=3.4が必要ですが、jupyterlab 4.0.5がインストールされているため互換性がありません。
google-cloud-aiplatform 0.6.0a1はgoogle-api-core[grpc]<2.0.0dev,>=1.22.2を必要としますが、google-api-core 2.11.1がインストールされているため互換性がありません。
google-cloud-aiplatform 0.6.0a1はgoogle-cloud-bigquery<3.0.0dev,>=1.15.0が必要ですが、google-cloud-bigquery 3.17.2があるため互換性がありません。

最新パッケージは過去を補完しないの？と思ったけど、再度Runかけると消えるので、そういうもんだと思うことにした。

５．自分のデータセットを覗く

Kaggle で用意されている初心者用エクササイズを参考に、公開データの「Crime in the city of Chicago」を使っていこうと思います。

あらかじめGoogleのMyProjectに入れておく。これは自分のデータセットにつなげられたけど、権限ないのもあるんだよなー。FromでPublicData指定して結果を持ってくるとかかな。

２の段階で、「bigquery_client」には私のSQLプロジェクトが指定されています。

＞＞＞+Codeでの記述＜＜＜
# データセット "chicago_crime_data "内の全てのテーブルをリストアップする。
tables = list(bigquery_client.list_tables('chicago_crime_data'))
# データセット内のすべてのテーブルの名前を表示する
for table in tables：
print(table.table_id)

＞＞＞結果＜＜＜
crime

でた。「chicago_crime_data」の配下には、テーブル「crime」のみがあります。

このPythonの書き方に慣れなきゃ。いったんtablesに入れて for table in tables って、tablesって変数なん？（チュートリアルコピペなので判断が危うい）次は独自名にしよう。print() 行はテンプレのようだ。

６．テーブルのスキーマを表示させる

chicago_crime_data.crime のスキーマを表示させます。

# chicago_crime_data "データセットへの参照を構築する。
dataset_ref = bigquery_client.dataset("chicago_crime_data", project="●●●●●")
table_ref = dataset_ref.table("crime")
# APIリクエスト - テーブルを取得
table = bigquery_client.get_table(table_ref)
table.schema

dataset_ref にデータセットいれて、 table_ref にデータセットのcrimeテーブル入れて、 table に入れて、スキーマって。まわりくどい。これがテンプレなのか。

＞＞＞結果＜＜＜
[SchemaField('unique_key', 'INTEGER', 'REQUIRED', None, (), None),
SchemaField('case_number', 'STRING', 'NULLABLE', None, (), None),
SchemaField('date', 'TIMESTAMP', 'NULLABLE', None, (), None),
SchemaField('block', 'STRING', 'NULLABLE', None, (), None),
--- 以下略 ---

おお、出た。見た目微妙だけど。crimeテーブルに入っているデータの設定たちです。()内左から('フィールド名','種類','モード','キー','照合','デフォルト値')ですかね。前半三つしか設定されてないので、後半はNoneです。

やたー。ちゃんとテーブル指定してるから共有データではない。これなら自分のSQL結果が見られそう。次はNotebook内のPython記述でSQLを直接動かせるかやってみます。あまり大きいと容量食うのかな。そのときは結果だけポートフォリオに表示すればいいか。

********

First, I access my SQL from a Kaggle notebook. A record of connecting to my SQL from the notebook and displaying the schema of one of the tables there.

I'm still getting used to Python. In my school days it was C++ (don't remember it anymore). Display & table output will be a subject for future review. Also, Kaggle is written in English, so I'll post the progress here in Japanese translation. The English description in the second half of the blog is the original. I'll stop translating when it becomes too much trouble.

Since a responsive display is inconvenient for images, the coding parts are written in quotes, while long parts are omitted. The part of the Python description with the +Code button in the notebook is added ">>> Description with +Code <<<" (the actual comment out in the code is "#").

Template at the first
Connecting to Google Big Query
pip installation
About collision errors
A look inside my dataset
Viewing the table schema

1.Template at the first

A tutorial that appears automatically when you open the notebook. I guess those who are familiar with it will cut it down. ← I didn't really read it until I translated it.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
--- omission ---
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

It comes in automatically, so fortunately I don't have to touch it and go through it.

2.Connecting to Google Big Query

Call personal SQL. The # lines are the commented out explanatory sections.

>>> Description with +Code <<<
# BigQuery
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='●●●●●')

●●●●● is your personal project ID. If you are using the SQL workspace in Googlecloud, the first project name is "My First Project". This has a unique ID attached to it. I think you can create up to 3 projects if it's free. The datasets and tables will hang under the project name.

If you are used to SQL, you will be disappointed.

If you want an answer.

>>> Description with +Code <<<
print("Setup Complete")

and you will get a reply.

3.pip installation

>>> Description with +Code <<<
pip install --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]'

As I wrote in the last issue (2) here, I think it is related to the latest version update in shell. On the other hand, here is the installation code for the BigQuery client library for Python without pandas.

>>> Description with +Code(For people who don't need pandas) <<<
pip install --upgrade google-cloud-bigquery

First because I want pandas.

>>> result <<<
Requirement already satisfied: pandas-gbq in /opt/conda/lib/python3.10/site-packages (0.21.0)
Requirement already satisfied: google-cloud-bigquery[bqstorage,pandas] in /opt/conda/lib/python3.10/site-packages (3.17.2)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from pandas-gbq) (68.1.2)
--- omission ---

The results come in a row. If you get a lot of responses, you might feel like you've done something great, but that's just an illusion. It's just that the command did a great job installing.

4. About collision errors

When I open Notebook and run the code for the first time, I get an error in the pip section.

>>> result <<<
--- omission ---
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
beatrix-jupyterlab 2023.814.150030 requires jupyter-server~=1.16, but you have jupyter-server 2.12.1 which is incompatible.
beatrix-jupyterlab 2023.814.150030 requires jupyterlab~=3.4, but you have jupyterlab 4.0.5 which is incompatible.
google-cloud-aiplatform 0.6.0a1 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.11.1 which is incompatible.
google-cloud-aiplatform 0.6.0a1 requires google-cloud-bigquery<3.0.0dev,>=1.15.0, but you have google-cloud-bigquery 3.17.2 which is incompatible.

Doesn't the latest package complete the past? I thought so, but it disappears when I run it again, so I decided to assume that's the way it is.

5. A look inside my dataset

I'm going to use the public data "Crime in the city of Chicago" with the tutorial provided by Kaggle.

I'll post it to Google's MyProject first. I was able to connect this to my own dataset, but there are some detasets I don't have permission to do - I guess I could specify PublicData in From and get the results when we met this situation.In step 2, my SQL project is specified in "bigquery_client".

>>> Description with +Code <<<
# List all the tables in the "chicago_crime_data" dataset
tables = list(bigquery_client.list_tables('chicago_crime_data'))
# Print names of all tables in the dataset
for table in tables:
print(table.table_id)

>>> result <<<
crime

I got it. Under "chicago_crime_data" there is only one table "crime".

I have to get used to this Python way of writing. I'm not sure if "tables" is a variable or not (I copied and pasted the tutorial, so I'm not sure). Next time I'll try original name. print() line looks like a template.

6.Viewing the table schema

Display the schema for chicago_crime_data.crime.

>>> Description with +Code <<<
# Construct a reference to the "chicago_crime_data" dataset
dataset_ref = bigquery_client.dataset("chicago_crime_data", project="●●●●●")
table_ref = dataset_ref.table("crime")
# API request - fetch the table
table = bigquery_client.get_table(table_ref)
table.schema

Put the dataset in dataset_ref, put the crime table in the dataset in table_ref, put it in table, schema.... It's a bit complicated. Is this a template?

>>> result <<<
[SchemaField('unique_key', 'INTEGER', 'REQUIRED', None, (), None),
SchemaField('case_number', 'STRING', 'NULLABLE', None, (), None),
SchemaField('date', 'TIMESTAMP', 'NULLABLE', None, (), None),
SchemaField('block', 'STRING', 'NULLABLE', None, (), None),
--- omission ---

Oh, there it is. It's a little subtle in appearance.

Here are the settings for the data in the crime table. (From left to right in brackets ('field name', 'type', 'mode', 'key', 'collation', 'default value'). Since only the first three are set, the second half is None.

Yay! I'm specifying the table correctly, so it's not shared data. I can see my SQL results. Next, I'll see if I can move the SQL directly into the Python description in the notebook. If it's too big, it might take up too much space. In that case, I should just display the results in the portfolio.

DATA idm8

Aim for a comprehensive analysis. Data-informed decision making. データ分析／著作権・知的財産マネジメント