data frames

Here are some good Dataframe examples It is using the Movie-Lens data set

import pandas as pd
import numpy as np

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('u.user', sep='|', names=u_cols)
#2nd Data Set
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('u.item2', sep='|', names=m_cols, usecols=range(5))
#Show 2 Columns
users[['age', 'zip_code']].head()
age zip_code
0 24 85711
1 53 94043
2 23 32067
3 24 43537
4 33 15213
#can also store in a variable to use later
columns_you_want = ['occupation', 'sex'] 
users[columns_you_want].head()
occupation sex
0 technician M
1 other F
2 writer M
3 technician M
4 other F
# users younger than 30 OR female
users[(users.sex == 'F') | (users.age < 30)].head(3)
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
# users younger than 30 AND female
users[(users.sex == 'F') & (users.age < 30)].head(3)
user_id age sex occupation zip_code
11 12 28 F other 06405
23 24 21 F artist 94533
31 32 28 F student 78741
users.set_index('user_id').head()
age sex occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
#Set and Index and return new Dataframe
with_new_index = users.set_index('user_id')
#Inplace
#Note this looses the pandas index
users.set_index('user_id', inplace=True)
users.head()
age sex occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
#Access using the user-id which is the index....
users.ix[[1, 50, 300]]
age sex occupation zip_code
user_id
1 24 M technician 85711
50 21 M writer 52245
300 26 F programmer 55106
#Reset the index back
users.reset_index(inplace=True)
users.head()
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})

INNER JOIN

pd.merge(left_frame, right_frame, on='key', how='inner')
key left_value right_value
0 2 c f
1 3 d g
2 4 e h

left outer join

#Add Values from Right_Frame if they exists
pd.merge(left_frame, right_frame, on='key', how='left')
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h

right outer join

#Merge using the right-frame as the master... if nothing in left then left is NaN
pd.merge(left_frame, right_frame, on='key', how='right')
key left_value right_value
0 2 c f
1 3 d g
2 4 e h
3 5 NaN i
4 6 NaN j

full outer join

#kept everything from both frames
pd.merge(left_frame, right_frame, on='key', how='outer')
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h
5 5 NaN i
6 6 NaN j

Combining

#Combine the two data frames
pd.concat([left_frame, right_frame])
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c NaN
3 3 d NaN
4 4 e NaN
0 2 NaN f
1 3 NaN g
2 4 NaN h
3 5 NaN i
4 6 NaN j

Change the axis

pd.concat([left_frame, right_frame], axis=1)
key left_value key right_value
0 0 a 2 f
1 1 b 3 g
2 2 c 4 h
3 3 d 5 i
4 4 e 6 j

Joining 2 Data Frame Selections

I want to make a new dataframe (I could do df[df.col==X]) but it gets unweidly at times

#5000 Records of each to start with
ot=validated[validated.ID='AA'][:5000]
oo=validated[validated.ID='ZZ'][:5000]
frames=[ot,oo]
new_df = pd.concat(frames)

Query

I can select just people called 'Tim' like this

 just_tim=val[val.NAME='Tim']

Or use the query method

 just_tim=val.query("Name"==["Tim"])
 #or
 just_tim=val.query("Name"==["Tim,"Timothy"])

With Numbers

Initially I tried

import numexpr 
df.query("Found" > 1)

And that fails badly

After some reading it can be done using

import numexpr df.query('Found >1')

Create Dataframe from 3 lists

import pandas as pd

d=pd.DataFrame({'Name': ["tim","bill","bob"],
     'Bill': [10,20,30],
     'Salary':[10,15,20] 
    })
 ```

#group by

```python
#i want average age so remove all the age==0
d2=data[data.age>0]
#now pick only 3 column
# age pclass and sex_code
d3=d2[['age','pclass','sex_code']]
#Now group by and apply mean function
d3.groupby(by=['pclass','sex_code']).mean()


#in 1 line it is ....
data[data.age>0][['age','pclass','sex_code']].groupby(by=['pclass','sex_code']).mean()

Complex Updating of data

I want to set the value of a group of 3 values (age,pclass and sex_code) to a specific value.

Alas the expression

data[(data.pclass==1)&(data.sex_code==0)&(data.age==0)]['Age']=37

Is not acceptable....

Get the Index of the data

We need to just have the index of the matching data

idx=data[(data.pclass==1)&(data.sex_code==0)&(data.age==0)].index

You can view the index using

idx

Setting 1 item at a time (not a good idea)

for i in idx:
    print("Setting Index {}".format(i))
    data.loc[i,'age']=37

Setting all at the same time

    data.loc[idx,'age']=37

Example

Untitled

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_HTML"></script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [ ['$','$'], ["\\(","\\)"] ],
        displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
        processEscapes: true,
        processEnvironments: true
    },
    // Center justify equations in code and markdown cells. Elsewhere
    // we use CSS to left justify single line equations in code cells.
    displayAlign: 'center',
    "HTML-CSS": {
        styles: {'.MathJax_Display': {"margin": 0}},
        linebreaks: { automatic: true }
    }
});
</script>
<!-- End of mathjax configuration --></head>

In [4]:
import pandas as pd
In [27]:
d={'A':['aaa','bbb','ccc'],'B':[1,2,3]}
dfc = pd.DataFrame(d)
dfc
Out[27]:
A B
0 aaa 1
1 bbb 2
2 ccc 3
In [20]:
dfc[(dfc.A=='aaa')&(dfc.B==1)]['B']=4
/Users/timothyhseed/pe36/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [31]:
dfc.loc[0,'A']=44
In [32]:
dfc
Out[32]:
A B
0 44 1.0
1 bbb 2.0
2 ccc 3.0
0 44 NaN
In [ ]:
 
</div>

Data From from a list

data=[[1,2,3,4,5,6,7,8,9],[1,1,1,1,2,2,2,2],['A','B','C','D','B','C','D','E']]

import pandas as pd
import numpy as np

df=pd.DataFrame(data).transpose()
    0   1   2
0   1   1   A
1   2   1   B
2   3   1   C
3   4   1   D
4   5   2   B
5   6   2   C
6   7   2   D
7   8   2   E
8   9   NaN NaN