Pandas - with Python

I have been using or rather trying to use R for my statistical work - but because I spend 90% of my time in other languages I find the way or R thinking somewhat confusing.

I however spend 60% of my time using Python.... which I think is one of the nicests languages I have used to date (and the list is quite long).

Getting Started with Pandas

I think for a beginner - you would be best served by using iPython and the excellent notebook facility. It makes things much easier.

I would also add that I am doing this with iPython3 - which has a nicer notebook interface than iPython (2.7)

My final recommendation (and we have not got to Pandas yet !!) is to install virtualenv ... oh an git .....

This walk-through however does not assume that you have these products.

Install Pandas

Several ways you can do this - using your OS repository command, or by using pip or finally virtualenv it.

This is the rough guide to all those ways.

sudo apt-get install python-pandas
sudo pip install pandas
   OR
mkdir pandas_test
cd pandas_test
virtualenv test1
source test1/bin/active
pip install pandas

CSV Time

Pandas helps you crunch data.... so lets create some data.

#!/usr/bin/python3
import random
import datetime
secs=1400000000
print ("when,word,count")
words=['bill','tom','frank']
for n in range(1,365):
 for h in range(0,24):
  reftime=datetime.datetime.fromtimestamp(secs+(3600*n*h))
  for w in words:
    line="{0} ,{1},{2} ".format(
    str(reftime.strftime("%Y-%m-%d %H:%M")), w,int(random.random()*20))
    print ("%s"%(line))

To run this - simple type

python3 makedata.py > timeline.csv

Get the CSV data into pandas

From now on I am assuming you are using iPython/idle ... etc

import pandas
recs=pd.read_csv('timeline.csv',
names=['when','word','cnt'],
parse_dates={'datetime':['when']},
keep_date_col = True,
index_col='datetime')

And we should have read in the data from the csv file.

Check what columns we have

recs.columns.get_values()

and something like this should be displayed

array(['when', 'word', 'cnt'], dtype=object)

We can then look at the dataframe (recs) by just typing recs

recs
datetime
when    when    word    count
2014-05-13 20:53        2014-05-13 20:53        bill    0
2014-05-13 20:53        2014-05-13 20:53        tom     18
2014-05-13 20:53        2014-05-13 20:53        frank   19

We have our data ... now what ?

This is often the point that the Programmer/Analyst starts to give up - and the Data Scientist starts to get excited.

  • We have data
  • It appears to all have been imported
  • What can we learn from this ?

Filter all the records by word

Instead of seeing all the records, I want to see the Bill records

Logically you can see this doing

recs['word']=='bill'

And this will show something like

2014-05-13 20:53      True
2014-05-13 20:53     False
2014-05-13 20:53     False
2014-05-13 21:53      True
2014-05-13 21:53     False
2014-05-13 21:53     False
2014-05-13 22:53      True

But you probably are not interested in the data from a logical point of view - you want only to see Bills records.

recs[recs['word'].isin(['bill'])]
                        when    word    cnt
datetime
2014-05-13 20:53        2014-05-13 20:53        bill    0
2014-05-13 21:53        2014-05-13 21:53        bill    6
2014-05-13 22:53        2014-05-13 22:53        bill    19
2014-05-13 23:53        2014-05-13 23:53        bill    16
2014-05-14 00:53        2014-05-14 00:53        bill    17
2014-05-14 01:53        2014-05-14 01:53        bill    13

This can be refined again - just to display the '''cnt''' column as

recs[recs['word'].isin(['bill'])]['cnt']