【SQL】CTE:全体件数とタイプ別割合(Total number of cases and ratio by type)

ここからは、公開データ「chicago_crime_data」を使って、データの加工と可視化をしていきます。 NULLのないきれいなデータです。

  1. 全体の件数を出して、タイプ別の割合を表示(サブクエリ)
  2. データの開始-終了日時、全体の件数

1.全体の件数を出して、タイプ別の割合を表示(サブクエリ)

【SQL】

WITH  crime_sum AS (
SELECT
 COUNT(unique_key) AS total_crime
FROM `angular-radar-411310.chicago_crime_data.crime` )

SELECT
 primary_type,
 COUNT(primary_type) AS category_num ,
 ROUND(COUNT(primary_type) / MAX(total_crime), 2) AS ratio
FROM `angular-radar-411310.chicago_crime_data.crime`, crime_sum

WHERE date IS NOT NULL

GROUP BY primary_type

ORDER BY category_num DESC

LIMIT 20

結果

  • primary_type:(元データ)逮捕容疑
  • category_num:タイプ別のの合計
  • ratio:全体からみた比率

全体で36種類ありました。17位以降は0%なので、省略。


THEFT(窃盗), BATTERY(暴行), CRIMINAL DAMAGE(器物損壊), NARCOTICS(麻薬), ASSAULT(脅迫)と続きます。

assault と battery の違い 
《法律》脅迫と暴行(罪)◆assaultが暴行を加える可能性があることを知らせて恐怖心を持たせること、batteryが違法な肉体的接触を意味し、それらが関連して行われること。assaultは肉体的接触を伴わないので、言葉による脅しや相手に拳銃を向ける行為などが含まれる。(英辞郎 on the WEB)


WITH句のこと、某教材で「サブクエリ」と覚えてたんですが、CTE(Common Table Expression)のほうが一般的のようです。別教材で知ったときは混乱しました。


2.データの開始-終了日時、全体の件数

3万件で0%になるって全体が何件なの、というのはスキーマでわかるんですが、期間と年あたりの平均も知りたいので出してみます。今後の可視化で使えそう。

【SQL】

SELECT
  MIN(date) AS first_day,
  MAX(date) AS last_day,
  MAX(EXTRACT(YEAR FROM date)) - MIN(EXTRACT(YEAR FROM date)) AS period,
  COUNT(unique_key) AS total_crime,
  ROUND(COUNT(unique_key) / (MAX(EXTRACT(YEAR FROM date)) - MIN(EXTRACT(YEAR FROM date))), 1) AS year_avg
FROM   `angular-radar-411310.chicago_crime_data.crime`

結果

  • first_day, last_day:最初の記録日と最後の記録日
  • period:データの期間
  • total_crime, year_avg:総件数と年平均

シカゴの犯罪件数、年間平均34万件もある。しかも更新される生データだった…あ、となるとWHEREで期間を指定しないと、年平均じゃないですね。今年は3か月分しか入ってない。

データセットの補足情報に、以下の注意書きがありました。

このデータには、警察署に提供された未検証の報告書が含まれています。予備的な犯罪分類は、追加の調査に基づいて後日変更される可能性があり、機械的または人的エラーの可能性が常にあります。したがって、シカゴ警察は、情報の正確性、完全性、適時性、または正しい順序を (明示的または黙示的に) 保証しません。この情報は、長期にわたる比較の目的で使用しないでください。

ということで、次からは期間を区切っていこうと思います。


<参考>

使用データ:[ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)

英辞郎 on the WEB:assault and batteryとは


********


From here, the public data 'chicago_crime_data' is used to process and visualise the data. The data is clean and has no nulls.

  1. The total number of cases and the percentage of each type
  2. Start - End date and time of data, total number of cases


1. The total number of cases and the percentage of each type

[SQL]

WITH crime_sum AS (
SELECT
 COUNT(unique_key) AS total_crime
FROM `angular-radar-411310.chicago_crime_data.crime` )
SELECT

 primary_type,

 COUNT(primary_type) AS category_num ,

 ROUND(COUNT(primary_type) / MAX(total_crime), 2) AS ratio

FROM `angular-radar-411310.chicago_crime_data.crime`, crime_sum

WHERE date IS NOT NULL

GROUP BY primary_type

ORDER BY category_num DESC

LIMIT 20

[Result]

primary_type: (original data) arrest charges

category_num: total by type

ratio: ratio of total.

There were 36 types in total; after 17th place there was 0%, so they were omitted.

THEFT, BATTERY, CRIMINAL DAMAGE, NARCOTICS and ASSAULT follow.

Difference between ASSAULT and BATTERY 
◆assault means to put someone in fear by informing them of a possible assault; battery means unlawful physical contact, and they are related.
Assault does not involve physical contact and includes verbal threats and pointing a gun at the person. (Eijiro on the WEB)

I remembered WITH clause as a 'subquery' in some material, but CTE (Common Table Expression) seems to be more common. I was confused when I found it in another material.


2. Start - End date and time of data, total number of cases

How many cases are there overall? I can see that in the schema, but I also want to know the average per period and per year, so I'll put it out there. This could be used in future visualisations.

【SQL】

SELECT
 MIN(date) AS first_day, MAX(date) AS last_day,
 MAX(EXTRACT(YEAR FROM date)) - MIN(EXTRACT(YEAR FROM date)) AS period,
 COUNT(unique_key) AS total_crime,
 ROUND(COUNT(unique_key) / (MAX(EXTRACT(YEAR FROM date)) - MIN(EXTRACT(YEAR FROM date))), 1) AS year_avg
FROM  `angular-radar-411310.chicago_crime_data.crime`

[Result]

  • first_day, last_day: date of first and last data set
  • period: duration of data
  • total_crime, year_avg: total number and yearly average

The number of crimes in Chicago, an average of 340,000 per year. And it was raw data that was updated... Oh, and if I don't specify the period in WHERE, it's not an annual average. This year only three months are included.

Additional information in the dataset included the following notes.

This data includes unverified reports made to police stations. Preliminary crime classifications are subject to change at a later date based on additional investigation, and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department makes no warranty (express or implied) as to the accuracy, completeness, timeliness, or correct sequencing of the information. This information should not be used for long-term comparisons.

Therefore, I will separate the time periods in the following sections.


<Reference

Data used: [ Google Cloud Public Datasets] chicago_crime_data (Chicago police crime data from 2001 to present).

Eijiro on the WEB: What is assault and battery?

DATA idm8

Aim for a comprehensive analysis. Data-informed decision making. データ分析/著作権・知的財産マネジメント