【SQL】犯罪多発ブロックの期間増減(Periodic increase/decrease in certain high crime blocks)

実際住んでたら、ブロック名はわかりますよね。「あのへん危ない」がわかれば、送り迎えで通らないとか、繁華街とか、住人だからなんとなく聞いた噂とか。言い換えれば、それが周知だからこそ、地元警察も見回って声掛けして、結果、近辺の検挙率が上がるともいえる。

2010-2019の前半後半で、ブロック別の売春検挙率はどう変わったのか?明らかに全体量は半減。たいてい下がっていそうだが。

【SQL】

WITH count201014 AS(
 SELECT
 block, COUNT(*) AS count1014
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE
 date BETWEEN "2010-01-01" AND "2014-12-31"
 AND primary_type IN ('PROSTITUTION')
 AND date IS NOT NULL
 AND block IS NOT NULL
 GROUP BY block
 ORDER BY COUNT(*) DESC),
count201519 AS(
 SELECT
 block, COUNT(*) AS count1519
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE
 date BETWEEN "2015-01-01" AND "2019-12-31"
 AND primary_type IN ('PROSTITUTION')
 AND date IS NOT NULL
 AND block IS NOT NULL
 GROUP BY block
 ORDER BY COUNT(*) DESC)


SELECT block, count1014, count1519,
 CASE
 WHEN count1014 > count1519 THEN "↓"
 WHEN count1014 < count1519 THEN "↑▲"

 WHEN count1014 = count1519 THEN "="

 ELSE NULL END AS fluctuation

FROM count201014 JOIN count201519 USING(block)

 ORDER BY count1014 DESC


【流れ】

  1. WITH:前半(2010-2014)後半(2015-2019)それぞれでブロック別ランキング
  2. FROM:blockキーでJOIN
  3. 前半基準で並び替え
  4. 増加なら↑▲、減少なら↓、同じなら=を表示

ELSEでNULL入れてますが、条件で「block IS NOT NULL」入れてるから排除済み


【結果】

期間前半の件数でソートして、多い方から上位20位です。上がってるとこあるー。

全体の減少率と各ブロックの増減を判定する検定はどれだろう。対応あり量的データのノンパラ、ウィルコクソンの符号付き順位検定?材料(ブロック)は484種あるけど、手打ちなだけにダブってそうだ。ここは別途見たほうがいいか…。シカゴ警察のデータ入力方法が、プルダウンやツールで簡便化されてるとか知っていれば、システム変更以外のタイプミスを疑わなくてもいいんだけどな。

全体量が半減してる中、一定数増えてるのは明らか有意だと思う。が、これで減少率ベースに各ブロックを検定したとして、何が得られるのか。

統計検定を使う有無も判断せねば。つまみ程度の知識で統計を扱ってもいけない。顧客に説明するときに、ざっくり言うのか、統計検定の根拠が欲しいのか、増減が大事なのか、基準が欲しいのか、こういうとき相談できる人がいたら心強い。


全体量半減にもかかわらず5倍以上増えているブロック

メーンクエリに

WHERE ( count1519 / count1014 ) >= 5

を追加します。

【結果】

17箇所ありました。3-5行「W FIFTH AVE」の増え方やばない。周辺環境でも変わったんだろうか。10行目の「008XX S KOLMAR AVE」も90倍。国が違うからブロック表記の見方もピンとこない!落とし込むかな~。。



********


If you actually live there, you know the names of the blocks in the area. If you know that "this area is dangerous", you won't pass it on your way to and from school, you won't go near it, and you'll hear rumours about it from residents. In other words, because it's known, the local police will look around the area and talk to people, resulting in a higher arrest rate in the area.

How did the arrest rate for prostitution per block change in the first half of 2010-2019? It is clear that the total number has halved. So basically we would expect to see a decrease in all locations.


【SQL】

WITH count201014 AS(
 SELECT
 block, COUNT(*) AS count1014
 FROM `angular-radar-411310.chicago_crime_data.crime`
 WHERE
 date BETWEEN "2010-01-01" AND "2014-12-31"
 AND primary_type IN ('PROSTITUTION')
 AND date IS NOT NULL
 AND block IS NOT NULL
 GROUP BY block
 ORDER BY COUNT(*) DESC),

count201519 AS(

 SELECT

 block, COUNT(*) AS count1519

 FROM `angular-radar-411310.chicago_crime_data.crime`

 WHERE

 date BETWEEN "2015-01-01" AND "2019-12-31"

 AND primary_type IN ('PROSTITUTION')

 AND date IS NOT NULL

 AND block IS NOT NULL

 GROUP BY block

 ORDER BY COUNT(*) DESC)

SELECT block, count1014, count1519,

 CASE

 WHEN count1014 > count1519 THEN "↓"

 WHEN count1014 < count1519 THEN "↑▲"

 WHEN count1014 = count1519 THEN "="

 ELSE NULL END AS fluctuation

FROM count201014 JOIN count201519 USING(block)

 ORDER BY count1014 DESC


【Workflow】

  1. WITH: Ranking by block in the first half (2010-2014) and second half (2015-2019) respectively
  2. FROM: JOIN by block key
  3. Sort by first half criteria
  4. Display ↑▲ if increase, ↓ if decrease, = if same

NULLs are inserted in ELSE, but have already been eliminated because "block IS NOT NULL" is inserted in the condition.


【Result】

The top 20 are sorted by the number of cases in the first half of the period, from highest to lowest. Some places are up.

Which test determines the overall decrease and the increase or decrease in each block? Non-parametric Wilcoxon signed rank test for quantitative data with paired values? There are 484 different materials (blocks) but I am concerned about duplication as they are hand input.

If I knew, for example, that the Chicago Police Department's data entry method was simplified with pull-downs and tools, I wouldn't have to suspect typos other than a system change.

It's clearly significant difference that there are a certain number of increases while the total volume has halved. But what would it tell us if we tested each block based on the rate of decrease?

We also have to decide whether or not to use statistical tests. I don't think I should do a statistical test carelessly with beginner's knowledge.

When explaining to customers, do you want to say in broad terms, do you want the basis for a statistical test, is it important to increase or decrease, or want a standard?

It would be reassuring to have someone to talk to in these situations.


Blocks increasing more than fivefold despite halving overall volume

In the main query

WHERE ( count1519 / count1014 ) >= 5

is added to the main query.


【Result】

There were 17 locations, and the increase in lines 3-5, 'W FIFTH AVE', is amazing. Has the environment changed? Line 10, '008XX S KOLMAR AVE' has also increased 90 times.

I don't know how to see the block notation because I'm from another country. I wonder if I should put it in the map :)