Pandas tricks - Aide Memoir
This is a small resource I have used to record functions that I commonly use and I also commonly forget in Pandas.
Dividing Values in the same row by one Another (pandas.DataFrame.apply)
For example if I am given the first two columns of the table below, and I want to calculate the amount spent per day (third column), dividing the total_amount by the number of days.
total_amount | days | amount_per_day |
---|---|---|
10 | 4 | 2.50 |
8 | 4 | 2 |
12 | 3 | 4 |
This is achieved using the apply method. You pass the apply method the function you wish to manipulate each row with. Lambda functions are particularly useful if the operation is simplistic, alternatively you can pass a function defined elsewhere as a parameter. The columns in both cases can be referred to by using the columns name as a series index to call the value for that column on a given row.
I have shown two examples below one using a lambda function to calculate the number of days for a transaction by dividing the total amount by the amount claimed per day. This is nice and short so a lambda function works very well. Though as the second example shows it is possible to also pass a function to apply to separate the row operations.
In both operations we are wanting to treat each row individually when applying the function therefore, axis=1 is set, the default is axis=0, which applies the function to each column.
df['amount_per_day '] = df.apply(lambda x : round(x['total_amount']/x['days']),axis=1)
Second example, same result just achieved by passing apply a function:
def average_expense(x):
av_exp = x['total_amount']/x['days']
return av_exp
df['amount_per_day'] = df.apply(average_expense, axis=1)
Filtering out non-numeric values from columns
If you have done much recuperation of data from old excel sheets you will find that they are often filled with mixed types. omething I come across commonly is people putting text into predominantly numerical columns for example in the order_id column of a spreadsheet a user may have put someones name ‘Joe Bloggs’ / ‘Error’ / ‘stock purchase’. Unfortunately types are not well controlled compared to databases. But excels especially in older companies contain lots of useful data. This means when you try to look in your SQL database or join tables or just try to insert your excel data into sql you will have lots of errors. Anyway this was frustrating me in Pandas and I used the below code snipet to solve my problems.
It requires passing a regex expression (in my case it checks the first and last character are numeric) and then the match method to return a bool depending on if the entry returns values from the regex expression.
index | order_id | amount |
---|---|---|
1 | 4000 | 2.50 |
2 | 4001 | 2.00 |
3 | missing | 4.00 |
num_filt = re.compile('^[0-9]*[0-9]$') # define regex filter
df_test['order_id'].str.match(reg) # This returns an expression with boolean as to whether regex statement is satisfied.
index | order_id —: | —: 1 | True 2 | True 3 | False
Therefore, to return only rows which contain valid order_id, you can filter the data table based on the above bool.
df_test[df_test['col1'].str.match(reg) == True] # Then the bool response is used to filter table rows.
index | order_id | amount |
---|---|---|
1 | 4000 | 2.50 |
2 | 4001 | 2.00 |
Converting data post groupby into bools for count
df.astype(bool).astype(int)