【SQL】CTE:年別総数とタイプ別割合(Total number by year and ratio by type)

データが随時更新だった、という前回の反省を踏まえて、今後はあらかじめ期間を指定してから作業します。

  • 使用データ:[ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • 期間:2010-01-01 ~ 2020-12-31(10年間)

年別の犯罪タイプ件数と、年間総数における割合(サブクエリ)

【SQL】

WITH total_count AS(
 SELECT
 FORMAT_DATETIME("%Y", date) AS year,
 COUNT(primary_type) AS sum_year,
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"
AND date IS NOT NULL
 GROUP BY FORMAT_DATETIME("%Y", date)
), type_year AS(
 SELECT

 FORMAT_DATETIME("%Y", date) AS year, primary_type,

 COUNT(primary_type) AS primary_count

 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"

 AND date IS NOT NULL

 GROUP BY FORMAT_DATETIME("%Y", date), primary_type

)


SELECT

 year, primary_type, primary_count,

 ROUND(((primary_count / sum_year) * 100 ),2 ) AS type_percent,

 sum_year

FROM total_count JOIN type_year USING(year)

ORDER BY year, primary_count DESC

【結果】

  • year:年
  • primary_type:(元データ)逮捕容疑
  • primary_count:各年それぞれ、タイプ別の合計
  • type_percent:各年それぞれ、全体からみた比率
  • sum_year:年ごとの合計件数


>>流れ>>

  1. サブクエリ「total_count」:sum_yearを出す
  2. サブクエリ「type_year」:年ごとの犯罪タイプ件数
  3. メーン:(FROM)サブクエリをyearキーでINNNER JOIN
  4. (SELECT)primary_count を sum_year で割って%計算
  5. (ORDER BY)年と件数で並び替え


元データがDATETIME型なので、「FORMAT_DATETIME("%Y",year)」を使って年に統一します。WHEREとFORMAT_DATETIME、IS NOT NULLあたりうまく合体できないかと思いつつ、視点が違うので別サブクエリに。

きっとうまく統合してスタイリッシュに書けたところで、解説を書いておかないと後で忘れる。つまり、下手な遠回りだったり冗長的なコードでも、今の自分のレベルに合った書き方をしないと自分ですら読み解けなくなる。

個別カウントと総数を同じ列に並べて表示させたことないな…インポートしたExcel表で「合計」行を除外することは多々。可視化だと積み上げ棒で合計値が分かるけど、分析で使う表と一般的な表とで、たまにズレを感じる。


type_peacent はツールによっては自動で出てくるので、データ提出しないなら計算しなくていいかも。このデータだけである程度可視化できますね。


*******


Based on the previous consideration that the data were updated at any time, in the future work will be done after specifying the period in advance.

  • Data used: [ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • Period: 2010-01-01 - 2020-12-31 (10 years)


Number of crime types by year and percentage of total number of crimes per year (CTE)

【SQL】

WITH total_count AS(
 SELECT
 FORMAT_DATETIME("%Y", date) AS year,
 COUNT(primary_type) AS sum_year,
 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"

AND date IS NOT NULL

 GROUP BY FORMAT_DATETIME("%Y", date)

), type_year AS(

 SELECT

 FORMAT_DATETIME("%Y", date) AS year, primary_type,

 COUNT(primary_type) AS primary_count

 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"

 AND date IS NOT NULL

 GROUP BY FORMAT_DATETIME("%Y", date), primary_type

)

SELECT

 year, primary_type, primary_count,

 ROUND(((primary_count / sum_year) * 100 ),2 ) AS type_percent,

 sum_year

FROM total_count JOIN type_year USING(year)

ORDER BY year, primary_count DESC

【Result】

  • YEAR: Year
  • primary_type: (original data) arrest charges
  • primary_count: for each year, total for each type
  • type_percent: for each year, % of total
  • sum_year: total number of cases per year


>> Workflow >> 

  1. Sub-query 'total_count': returns sum_year
  2. Sub-query 'type_year': number of types of crime per year
  3. Main: (FROM) INNER JOIN sub-query with year key
  4. (SELECT) primary_count divided by sum_year to calculate %.
  5. (ORDER BY) order by year and number of cases


Since the original data is of type DATETIME, I used "FORMAT_DATETIME("%Y",year)" to unify it with "year". I thought it would be possible to merge WHERE, FORMAT_DATETIME and IS NOT NULL, but the perspective is different, so I made another sub-query.

I'm sure that if I successfully merge them and write them stylishly, I'll forget them later if I don't write an explanation.

In other words, even if it's a bad structure or redundant code, if you don't write it in a way that suits your current level, even you won't be able to read and understand it.

I've never had individual counts and totals in the same column... although I often exclude the 'totals' row in imported Excel tables.

In the visualisation I can see the totals in the stack bars, but sometimes I feel a discrepancy between the tables used in the analysis and the general tables.


The 'type_peacent' = 'percentage by crime' is automatically calculated in some tools. I am the type to calculate it for double checking.

We can visualise it to some extent with just this data.