March 18, 2020

Getting started with panthera and COVID-19

This is a post to help you get started working with COVID-19 data using panthera.

The first thing that we have to do is to setup our environment in the correct way. For making this work we need:

Java >= 1.8
Clojure (leiningen or deps)
Python >= 3.6
conda (or Docker)
numpy and pandas

I'll assume that Java and Clojure are already working as expected, so the next step is to get conda. If you're on Windows the only way to make this work is with Docker.

After you have installed conda you can create a new environment (please note that conda is a generic package manager, it works for everything not only Python):

conda create -n covid python=3.6 numpy pandas
source activate covid

Now you can add dependencies to your Clojure project (I'll assume deps.edn, but it works the same with leiningen):

{:deps
 {org.clojure/clojure {:mvn/version "1.10.1"}
  panthera            {:mvn/version "0.1-alpha.18"}}}

Everything's ready to run! Launch a REPL and require panthera:

user=> (require '[panthera.panthera :as pt])
user=> (def covid-url "https://bit.ly/2TYyDtg")
user/covid-url
user=> (def covid
         (pt/read-csv covid-url))
user/covid
user=> (pt/names covid)
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
       '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
       '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
       '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
       '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
       '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20',
       '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20',
       '3/15/20', '3/16/20', '3/17/20'],
      dtype='object')
user=> (pt/melt covid {:id-vars ["Province/State" "Country/Region" "Lat" "Long"]
                       :var-name :date})
       Province/State  Country/Region      Lat      Long     date  value
0                 NaN        Thailand  15.0000  101.0000  1/22/20      2
1                 NaN           Japan  36.0000  138.0000  1/22/20      2
2                 NaN       Singapore   1.2833  103.8333  1/22/20      0
3                 NaN           Nepal  28.1667   84.2500  1/22/20      0
4                 NaN        Malaysia   2.5000  112.5000  1/22/20      0
...               ...             ...      ...       ...      ...    ...
25755  Cayman Islands  United Kingdom  19.3133  -81.2546  3/17/20      1
25756         Reunion          France -21.1351   55.2471  3/17/20      9
25757             NaN        Barbados  13.1939  -59.5432  3/17/20      2
25758             NaN      Montenegro  42.5000   19.3000  3/17/20      2
25759             NaN      The Gambia  13.4667  -16.6000  3/17/20      1

[25760 rows x 6 columns]

And now you're basically ready to go! Please note that the shortened link points to this repo where you can find more datasets about the disease, all updated at least daily.

If you're wondering what exactly happened in the REPL above, keep reading. What we did was reading a csv file into a data-frame, which is a tabular data representation. The first thing we checked were the columns names - (pt/names covid) - and by just doing that we can see that data is kept into one column for everyday.

This isn't optimal, because it's easier to work with dates as categories and not columns. So we melt the data-frame reshaping it from a wide format to a long format. Now you can go about grouping, adding, reducing, etc however you want!

Don't hesitate to get in touch for any question! The easier way is through the github repo.

You can find a working example on Nextjournal

Tags: coronavirus panthera

« The Italian COVID-19 situation Doing machine learning with clj-boost »