Fixing a Couple Small Issues in Pandas

Fixing a Couple Small Issues in Pandas


This week I managed to fix a couple small issues in Pandas. One of them was created due to the behaviour observed in my previous Pull Request, where I ensured that the ordering of OrderedDicts and dicts in >= Python 3.6 was respected. In addition I made another simple fix while working on another issue, which didn't seem trivial at first, but in the end was just a simple mistake in the code. It was easy to fix the problem, however it took some time to actually find were the problem originated from. I will talk about fixing both of these small issues in more detail below.

Replacing Dicts with OrderedDicts in Aggregation Functions of Groupby:

While working on a previous Pull Request, which ensured that ordering of dicts and OrderedDicts was respected across different versions of Python, we came across one of the previously existing tests that was failing due to my changes. The failure of the test was caused by inconsistent ordering of columns in an aggregation function of groupby. In order to fix the problem, I first created a check that would switch the expected column order based on the Python version. The aggregation function was using a regular dict with a concat call. Due to my changes, the order of the dict will be different based on the Python version because dicts are sorted in Python 3.5 and below, while they maintain the order of user input in Python 3.6 and above. Therefore, the expected result would be different in the test based on the version of Python.

I had a lot of back and fourth with the maintainers to attempt to find another way to fix the problem without checking for the Python version in the test and to make the output consistent across all versions. Finally after some failed considerations, we came to a conclusion that we should just change the dict to an OrderedDict in the aggregation function that the test was using. This would ensure that the order of columns would be consistent across all versions. Once this was fixed, the maintainers decided that we should change dicts to Ordered Dicts in all aggregation functions of groupby and made a separate issue. 

Since I've already been working on this, I decided to continue with the new issue that was created. In this issue, I changed the dicts to OrderedDicts in the aggregation functions of groupby and wrote a test to ensure that the order is consistent across all versions.


Although these changes seem trivial, it took some time for us to come to the conclusion that this should be done. Below you will find a test that tests the consistency of the columns:

def test_order_aggregate_multiple_funcs():
# GH 25692
df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
res = df.groupby('A').agg(['sum', 'max', 'mean', 'ohlc', 'min'])
result = res.columns.levels[1]
expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min'])
tm.assert_index_equal(result, expected)


Respecting Dropna Argument when False in Pivot_Table:

The creator of the issue noticed that there was a bug when creating a pivot_table with the dropna argument set to False. The pivot_table function in Pandas creates a spreadsheet-like pivot table and stores it as a Dataframe. A pivot table summarizes the data in a more extensible table, where you can specify how to summarize the data, for example using sums and averages of the data. The dropna argument of the function is supposed to specify wether to drop the columns that have NaN values. By default the dropna argument is True and will drop all columns that have NaN values in them. However, when a user sets this argument to False, the pivot_table should include columns with NaN values. The creator of the issue noticed that even when he sets this argument to False, the columns with NaN values are still dropped.

This issue is related to one of my previous Pull Requests, where I worked on the dropna function, which dropped specified rows or columns based on NaN values. Because this issue related to what I previously worked with, I decided that I can use that knowledge and apply it. At first this seemed like an issue that would require a substantial amount of fixing because one of the functionalities of the pivot_table did not work at all. However, by the end I realized that the fix was a lot simpler then I thought.

Like always, I first wrote a test in order to ensure that the functionality of the dropna argument will be respected in the pivot_table. The following test ensures that when dropna argument is set to False, the returned Dataframe will contain columns with NaN values.

def test_pivot_table_aggfunc_dropna(self):
# GH 22159
df = pd.DataFrame({'fruit': ['apple', 'peach', 'apple'],
'size': [1, 1, 2],
'taste': [7, 6, 6]})

def ret_one(x):
return 1

def ret_sum(x):
return sum(x)

def ret_none(x):
return None

df2 = pd.pivot_table(df, columns='fruit',
aggfunc=[ret_sum, ret_none, ret_one],
dropna=False)

data = [[3, 1, None, None, 1, 1], [13, 6, None, None, 1, 1]]
col = pd.MultiIndex.from_product([['ret_sum', 'ret_none', 'ret_one'],
['apple', 'peach']],
names=[None, 'fruit'])
df3 = pd.DataFrame(data, index=['size', 'taste'], columns=col)

tm.assert_frame_equal(df2, df3)


The most time-consuming part of the test was to recreate the expected Dataframe. It is much easier to create a Dataframe based on what you're trying to achieve and through using the built-in functionalities of Pandas, like the pivot_table. However, manually creating a Dataframe, while also ensuring that this is done in the least possible amount of steps and lines of code is time consuming. But it allowed me to acquire more experience with Dataframes and I now understand how to manually create complex Dataframes, which will be useful for the future. 

When I created the test, I used it in order to track down where the problem originated from. It took some time and investigating to figure out where the problem originated from, however once I found the culprit, I couldn't believe that a simple mistake was causing such an issue. The above test uses the aggfunc argument of the pivot_table to allow multiple functions to perform calculations on the data and output a pivot table. The pivot_table function loops through these functions and calls itself again. However, when the function called itself in this loop, the developer who worked on this simply forgot to include the dropna argument in the call, making it always equate to True. So the simple fix was to add the argument to the function call.


This reminded me of the assignments I do for my Software Development degree. Sometimes, I make a simple mistake like this that results in problems that take some time to fix. However, it is surprising to see that such a big and popular project like Pandas would have such a simple mistake. But it was nice to see because it showed me that we are all human and can make simple mistakes sometimes and it doesn't make us bad software developers whenever we make a simple mistake.







Comments

Popular posts from this blog

Progress in Open Source

Planning Pandas Contributions for the Next Month

Unit Tests in Pandas