【SQL】日付、犯罪タイプ、位置情報を取得(Get date, crime type and location information)

位置情報を取り出します。可視化のために情報を制限します。10年ぶん、犯罪上位5タイプ、NULL除外です。

  • 使用データ:[ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • 期間:2010-01-01 ~ 2020-12-31(10年間)
  • 犯罪件数トップ5(THEFT(窃盗), BATTERY(暴行), CRIMINAL DAMAGE(器物損壊), NARCOTICS(麻薬), ASSAULT(脅迫))
  • 位置情報(緯度/経度、それぞれ)

上記条件で2,075,883件、うち緯度経度NULLは11,353件。

緯度経度なくてもblock(ストリート)情報はあったりする。(という確認もかねてblock取り出し)



【SQL】

SELECT
year, date, primary_type, latitude, longitude, block
FROM `angular-radar-411310.chicago_crime_data.crime`

WHERE

 date BETWEEN "2010-01-01" AND "2020-12-31"

 AND date IS NOT NULL

 AND primary_type IN ('THEFT','BATTERY','NARCOTICS','CRIMINAL DAMAGE','BURGLARY')

 AND latitude IS NOT NULL

ORDER BY date


【結果】データ数:2,064,530件

今さらなんですけど、データに「year」ありました。以前の記事で「FORMAT_DATETIME("%Y", date) AS year」 とかやってましたけど、いらんかった。うーん、しかし手打ち生データだからな…練習だったと思おう。

上位5位取り出し、3位と迷うところ。。容量と可視化のごちゃ感と…多いのは省けるけど足りないとまた取り出さないといけないので、いったんこれで進めます。



********


Extract location information. Restrict information for visualisation: decade, top 5 crime types, NULL exclusion.

  • Data used: [ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • Period: 2010-01-01 - 2020-12-31 (10 years)
  • Top 5 crimes (THEFT, BATTERY, VIRUSES, NARCOTICS, ASSAULT)
  • Location (latitude/longitude)

2,075,883 cases under the above conditions, of which 11,353 were latitude/longitude NULL.

There may be block (street) information without latitude/longitude. (Block retrieved to confirm).


【SQL】

SELECT
year, date, primary_type, latitude, longitude, block
FROM `angular-radar-411310.chicago_crime_data.crime`
WHERE
 date BETWEEN "2010-01-01" AND "2020-12-31"
 AND date IS NOT NULL
 AND primary_type IN ('THEFT','BATTERY','NARCOTICS','CRIMINAL DAMAGE','BURGLARY')
 AND latitude IS NOT NULL
ORDER BY date


[Results] Number of data: 2,064,530

I have just noticed that there is a 'year' column in the data.

I did 'FORMAT_DATETIME('%Y', date) AS year' in a previous article, but I didn't need it. Hmmm, but this is hand-typed data that gets updated from time to time... let's just say it was an exercise.

I'm on the fence. if I should take the top five or the top three. Will it look messy in terms of capacity and visualisation? If it's too much, I can leave it out, but if it's not enough, I'll have to take it out again. So once I've done. I'll go ahead with it.