Here I present in very simple terms the data and the situation in Italy
Italy has been the first European country to be hit by the coronavirus
pandemic, so a lot can be learned from our experience for other Western countries.
So, what would be interesting to look at?
First off, where to get reliable and up to date data about Italy? There are various sources, but the best one is the official one: https://github.com/pcm-dpc.
This data comes directly from the government and is updated everyday around 6 PM (Rome time). If we clone the repo we can see something like this:
git clone https://github.com/pcm-dpc/COVID-19.git
tree COVID-19
COVID-19
├── aree # Areas - here there are older shapes about the quarantined areas
│ ├── geojson
│ │ └── dpc-covid19-ita-aree.geojson
│ └── shp
│ ├── dpc-covid19-ita-aree.dbf
│ ├── dpc-covid19-ita-aree.prj
│ ├── dpc-covid19-ita-aree.shp
│ └── dpc-covid19-ita-aree.shx
├── CHANGELOG_EN.md
├── CHANGELOG.md
├── CODE_OF_CONDUCT_EN.md
├── CODE_OF_CONDUCT.md
├── dati-andamento-nazionale # National trend data
│ ├── dpc-covid19-ita-andamento-nazionale-20200224.csv # these are rows
...
│ ├── dpc-covid19-ita-andamento-nazionale-20200318.csv
│ └── dpc-covid19-ita-andamento-nazionale.csv # this is the main data file
├── dati-json # All data in JSON
│ ├── dpc-covid19-ita-andamento-nazionale.json # national
│ ├── dpc-covid19-ita-province.json # provinces
│ └── dpc-covid19-ita-regioni.json # regions
├── dati-province # Provinces trend
│ ├── dpc-covid19-ita-province-20200224.csv
...
│ ├── dpc-covid19-ita-province-20200318.csv
│ └── dpc-covid19-ita-province.csv
├── dati-regioni # Regions trend
│ ├── dpc-covid19-ita-regioni-20200224.csv
...
│ ├── dpc-covid19-ita-regioni-20200318.csv
│ └── dpc-covid19-ita-regioni.csv
├── LICENSE
├── LICENSE.xmp
├── metadata # This is for a dashboard created by the government
│ ├── covid-19-aree.xml
│ ├── covid-19-monitoraggio.xml
│ └── dpc-covid19-ita-andamento-nazionale.xml
├── README.md
├── schede-riepilogative # These are just PDF summaries
│ ├── province
│ │ ├── dpc-covid19-ita-scheda-province-20200303.pdf
...
│ └── dpc-covid19-ita-scheda-province-20200318.pdf
└── regioni
├── dpc-covid19-ita-scheda-regioni-20200302.pdf
...
└── dpc-covid19-ita-scheda-regioni-20200318.pdf
11 directories, 127 files
Ok, so first things first: Italy is a nation, is a member State of the EU and has a somewhat complicated system of territorial government layers.
The largest colored "regions" are, well, Regions. So Lombardia is a region, and Varese is a province of Lombardia. Be aware that province != city. Inside one province there are many cities and towns, but our data stops at the province detail.
The next step is to take a look at the data to see what's in there. To follow along you can check my introductory post for the setup.
(require '[panthera.panthera :as pt])
(defonce provinces-url
"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv")
(defonce regions-url
"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
(defonce provinces
(pt/read-csv provinces-url))
(defonce regions
(pt/read-csv regions-url))
Let's start from provinces
:
user=> (pt/names provinces)
Index(['data', 'stato', 'codice_regione', 'denominazione_regione',
'codice_provincia', 'denominazione_provincia', 'sigla_provincia', 'lat',
'long', 'totale_casi'],
dtype='object')
We can define a mapping to make column names more English and Clojure friendly:
(def mapper-provinces
{:data :date
:stato :state
:codice_regione :region-code
:denominazione_regione :region-name
:codice_provincia :province-code
:denominazione_provincia :province-name
:sigla_provincia :province-abbreviation
:lat :lat
:long :lon
:totale_casi :cases})
(pt/rename provinces {:columns mapper-provinces})
(defn renamer
[df mapper]
(pt/rename df {:columns mapper}))
(def provinces-renamed
(renamer provinces mapper-provinces))
We just changed column names for provinces, in this way everything can move on more easily. But let's see whether we have issues in the data.
user=> (-> (pt/subset-cols provinces-renamed :province-name) ; get cols by name
pt/n-unique) ; get the count of unique values in that col
108
Well, that's wrong! There are 107 provinces in Italy! Let's take a look at the values in there.
user=> (pt/unique ; return unique values in a series
(pt/subset-cols provinces-renamed :province-name))
['Chieti' "L'Aquila" 'Pescara' 'Teramo'
'In fase di definizione/aggiornamento' 'Matera' 'Potenza' 'Bolzano'
'Catanzaro' 'Cosenza' 'Crotone' 'Reggio di Calabria' 'Vibo Valentia'
'Avellino' 'Benevento' 'Caserta' 'Napoli' 'Salerno' 'Bologna' 'Ferrara'
'Forlì-Cesena' 'Modena' 'Parma' 'Piacenza' 'Ravenna' "Reggio nell'Emilia"
'Rimini' 'Gorizia' 'Pordenone' 'Trieste' 'Udine' 'Frosinone' 'Latina'
'Rieti' 'Roma' 'Viterbo' 'Genova' 'Imperia' 'La Spezia' 'Savona'
'Bergamo' 'Brescia' 'Como' 'Cremona' 'Lecco' 'Lodi' 'Mantova' 'Milano'
'Monza e della Brianza' 'Pavia' 'Sondrio' 'Varese' 'Ancona'
'Ascoli Piceno' 'Fermo' 'Macerata' 'Pesaro e Urbino' 'Campobasso'
'Isernia' 'Alessandria' 'Asti' 'Biella' 'Cuneo' 'Novara' 'Torino'
'Verbano-Cusio-Ossola' 'Vercelli' 'Bari' 'Barletta-Andria-Trani'
'Brindisi' 'Foggia' 'Lecce' 'Taranto' 'Cagliari' 'Nuoro' 'Oristano'
'Sassari' 'Sud Sardegna' 'Agrigento' 'Caltanissetta' 'Catania' 'Enna'
'Messina' 'Palermo' 'Ragusa' 'Siracusa' 'Trapani' 'Arezzo' 'Firenze'
'Grosseto' 'Livorno' 'Lucca' 'Massa Carrara' 'Pisa' 'Pistoia' 'Prato'
'Siena' 'Trento' 'Perugia' 'Terni' 'Aosta' 'Belluno' 'Padova' 'Rovigo'
'Treviso' 'Venezia' 'Verona' 'Vicenza']
I can tell you that 'In fase di definizione/aggiornamento'
is not a province, but indicates data that at that date wasn't yet attributed to a province.
(pt/filter-rows provinces-renamed
(pt/eq (pt/subset-cols provinces-renamed :province-name)
"In fase di definizione/aggiornamento"))
With the code above we get all the rows with the required value. We see that every day there's at least one row like that, most of the time the :cases
count is at 0, but sometimes it's not. I can tell you that the count gets added the day after to the right province, so we can just remove these rows.
(defn clean-data
[df mapper]
(as-> (renamer df mapper) d
(pt/filter-rows d
(pt/ne (pt/subset-cols d :province-name)
"In fase di definizione/aggiornamento"))))
(def provinces-clean (clean-data provinces mapper-provinces))
Let's do the same with Regions! I'll go faster here considering that things are very similar to the ones above.
(def mapper-regions
(merge
mapper-provinces
{:ricoverati_con_sintomi :hospitalized
:terapia_intensiva :icu
:totale_ospedalizzati :tot-hospitalized
:isolamento_domiciliare :quarantined
:totale_attualmente_positivi :tot-positives
:nuovi_attualmente_positivi :new-positives
:dimessi_guariti :recovered
:deceduti :dead
:tamponi :tests}))
(def regions-clean (renamer regions mapper-regions))
Now we can start analyzing data! For example:
(def regions-tests
(-> (pt/subset-cols regions-clean :date :tests :new-positives)
(pt/groupby :date)
pt/sum))
(defn daily-tests []
(-> regions-tests
(pt/subset-cols :tests)
pt/diff
(pt/fill-na
(first
(pt/values
(pt/subset-cols
(pt/subset-rows regions-tests 1)
:tests))))))
(as-> regions-tests r
(pt/assign r {:daily-tests (daily-tests)})
(pt/assign r {:new-by-test (pt/div (pt/subset-cols r :new-positives)
(pt/subset-cols r :daily-tests))}))
In this way we get the percentage of daily positive tests for the whole country.