【SQL】ウィンドウ関数:累積相対度数(window function:Cumulative relative degree)

WITH句を使わないFROM入れ子形式、ウィンドウ関数SUMを使って累積相対度数を出します。ツールによっては自動計算してくれそうだが。年ごとに最後は1になるので、ダブルチェック兼ねて。

  • 使用データ:[ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • 期間:2010-01-01 ~ 2020-12-31(10年間)


年別の犯罪タイプ上位からの足し上げと累積相対度数(ウィンドウ関数)

【SQL】

SELECT year, primary_type, sum_year,
SUM(sum_year) OVER (PARTITION BY year ORDER BY sum_year DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS sum_of_year,
ROUND((SUM(sum_year) OVER (PARTITION BY year ORDER BY sum_year DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / type_of_year), 6) AS crf


FROM (

 SELECT

 FORMAT_DATETIME("%Y", date) AS year, primary_type,

 COUNT(primary_type) AS sum_year,

 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"

 AND date IS NOT NULL

 GROUP BY year, primary_type

 ORDER BY 1, 3 DESC )

JOIN

( SELECT

 FORMAT_DATETIME("%Y", date) AS year,

 COUNT(primary_type) AS type_of_year

 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"

 AND date IS NOT NULL

 GROUP BY FORMAT_DATETIME("%Y", date)

) USING(year)


ORDER BY year, sum_year DESC

  • year:年
  • primary_type:(元データ)逮捕容疑
  • sum_year:各年それぞれ、タイプ別の合計
  • type_of_year:各年のトータル件数
  • sum_of_year:各年ごとの犯罪タイプ足し上げ件数
  • crf:足し上げ「sum_of_year」を年総数「type_of_year」で割った比率


>>流れ>>

  1. (FROM) 「年ごとの犯罪タイプ別件数」と「年ごとの総件数」を、year キーでJOIN
  2. (SELECT) ウィンドウ関数SUM : yearでパーティション、犯罪タイプ別件数が多い順に並び替えます。フレームはパーティションの上から下まで
  3. (SELECT) 2.で出した「sum_of_year」(年別の犯罪足し上げ数)を年の総数「type_of_year」で割る=年におけるその順位までの割合を出す
  4. (ORDER BY) year, sum_year DESC


【結果】

画像は2010年のみですが、実際は2020年まで繰り返されています。累積相対度数のいいところって、例えば画像で言うと、上位3位で半分(50%)占めてることがすぐわかる点ですね。

ウィンドウ関数に並び替えが含まれるから、一番外でORDERしたりFROM内でORDERしてるのは、重複感があります。ただ、私は記述中に部分で動きを確認するので(こういうときGoogle Big Query のUIはとても便利)そのままにしてます。


Rで累積相対度数

このデータを基に、Rでも表現しようかと思いましたが、わかりにくいかもしれない。

まず年をひとつ選んで、X軸に犯罪タイプを入れるとしても、連続値ではないから1位から順に数値を割り当てる。これは腹落ちしにくい。単純に年の連続値にするなら「年×年の犯罪件数」にしたほうが分かりやすいけど、それを累積相対度数にする意味がない。

データ入力と相対度数を出すコマンドの練習にはなるけど…うーん、カテゴリでするのはなんか違う。


********


FROM nested format without WITH clause, using the SUM window function to get the cumulative relative frequencies. Some tools may do this automatically. Each year will end in 1, so double-check as well.

  • Data used: [ Google Cloud Public Datasets] chicago_crime_data (Chicago Police Department crime data from 2001 to present)
  • Period: 2010-01-01 - 2020-12-31 (10 years)


Footnotes and cumulative relative frequencies from top crime types by year (window function)

【SQL】

SELECT year, primary_type, sum_year,
SUM(sum_year) OVER (PARTITION BY year ORDER BY sum_year DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS sum_of_year,
ROUND((SUM(sum_year) OVER (PARTITION BY year ORDER BY sum_year DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / type_of_year), 6) AS crf
FROM (
 SELECT
 FORMAT_DATETIME("%Y", date) AS year, primary_type,
 COUNT(primary_type) AS sum_year,
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"
 AND date IS NOT NULL
 GROUP BY year, primary_type
 ORDER BY 1, 3 DESC )
JOIN
( SELECT
 FORMAT_DATETIME("%Y", date) AS year,
 COUNT(primary_type) AS type_of_year
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE date BETWEEN "2010-01-01" AND "2020-12-31"
 AND date IS NOT NULL
 GROUP BY FORMAT_DATETIME("%Y", date)
) USING(year)

ORDER BY year, sum_year DESC

  • year : year
  • primary_type : (original data) arrest charges
  • sum_year : each year respectively, total by type
  • type_of_year : total number of cases for each year
  • sum_of_year : number of crime types added up for each year
  • crf : ratio of addition 'sum_of_year' divided by total number of years 'type_of_year'


>> Workflow >>

  1. (FROM) JOIN 'Number of crimes by type per year' and 'Total number of crimes per year' with year key
  2. (SELECT) SUM window function : sort by partition, number of crimes by type of crime by year in ascending order. Frame from top to bottom of partition.
  3. (SELECT) "sum_of_year" (total number of crimes by year) from 2. divided by total number of years "type_of_year" = percentage for this rank in the year
  4. (ORDER BY) year, sum_of_years DESC

【Result】

The image only shows the year 2010, but in reality it is repeated until 2020. The nice thing about the cumulative relative frequencies is that, for example, in the image, you can easily see that the top three positions account for half (50%) of the total.

The window function contains a sorting 'ORDER'. And yet there is a sense of duplication in "ORDER" on the outside and "ORDER" in the FROM. However, I check the movement in parts during the cording (Google Big Query's UI is very useful in such cases), so I leave it as it is.


Cumulative relative frequencies in R

Based on this data, I thought about expressing it in R as well, but it might be difficult to understand.

Even if I chose a year first and then put the type of crime on the x-axis, it would not be a continuous value, so hypothetically I would have to assign values in order from first to last. This is difficult for the viewer to understand. If you simply want to make it a continuous value for the year, it would be easier to understand if you made it "number of crimes in year x year", but there is no point in making it a cumulative relative frequency.

It's good practice for data entry and commands to produce relative frequencies, but... well, not better this time.

DATA idm8

Aim for a comprehensive analysis. Data-informed decision making. データ分析/著作権・知的財産マネジメント