This is a post to help you get started working with COVID-19 data using panthera.
The first thing that we have to do is to setup our environment in the correct way. For making this work we need:
I'll assume that Java and Clojure are already working as expected, so the next step is to get conda. If you're on Windows the only way to make this work is with Docker.
After you have installed conda you can create a new environment (please note that conda is a generic package manager, it works for everything not only Python):
conda create -n covid python=3.6 numpy pandas
source activate covid
Now you can add dependencies to your Clojure project (I'll assume deps.edn
, but it works the same with leiningen):
{:deps
{org.clojure/clojure {:mvn/version "1.10.1"}
panthera {:mvn/version "0.1-alpha.18"}}}
Everything's ready to run! Launch a REPL and require panthera:
user=> (require '[panthera.panthera :as pt])
user=> (def covid-url "https://bit.ly/2TYyDtg")
user/covid-url
user=> (def covid
(pt/read-csv covid-url))
user/covid
user=> (pt/names covid)
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
'1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
'1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
'2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
'2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
'2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
'2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
'3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20',
'3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
'3/15/20', '3/16/20', '3/17/20'],
dtype='object')
user=> (pt/melt covid {:id-vars ["Province/State" "Country/Region" "Lat" "Long"]
:var-name :date})
Province/State Country/Region Lat Long date value
0 NaN Thailand 15.0000 101.0000 1/22/20 2
1 NaN Japan 36.0000 138.0000 1/22/20 2
2 NaN Singapore 1.2833 103.8333 1/22/20 0
3 NaN Nepal 28.1667 84.2500 1/22/20 0
4 NaN Malaysia 2.5000 112.5000 1/22/20 0
... ... ... ... ... ... ...
25755 Cayman Islands United Kingdom 19.3133 -81.2546 3/17/20 1
25756 Reunion France -21.1351 55.2471 3/17/20 9
25757 NaN Barbados 13.1939 -59.5432 3/17/20 2
25758 NaN Montenegro 42.5000 19.3000 3/17/20 2
25759 NaN The Gambia 13.4667 -16.6000 3/17/20 1
[25760 rows x 6 columns]
And now you're basically ready to go! Please note that the shortened link points to this repo where you can find more datasets about the disease, all updated at least daily.
If you're wondering what exactly happened in the REPL above, keep reading. What we did was reading a csv file into a data-frame
, which is a tabular data representation. The first thing we checked were the columns names - (pt/names covid)
- and by just doing that we can see that data is kept into one column for everyday.
This isn't optimal, because it's easier to work with dates as categories and not columns. So we melt
the data-frame
reshaping it from a wide format to a long format. Now you can go about grouping, adding, reducing, etc however you want!
Don't hesitate to get in touch for any question! The easier way is through the github repo.
You can find a working example on Nextjournal