Another Dropna Bug in Pandas
Another Dropna Bug in Pandas
This year I fixed a few bugs that dealt with NaN values and more specifically the Dropna function in Pandas. Every bug like this made me more and more interested in fixing behaviour that is not consistent across the code base, specifically how the functions deal with non existent values. Some functions have a dropna argument. When set to true, the function should drop columns/rows that contain missing values and when set to false, the function should keep columns/rows that contain missing values.
Currently, some functions deal well with NaN values, however there are functions that either produce buggy behaviour with NaN or don't deal with it consistently across the code base. This week I found another bug like this and I will be working on it throughout next week.
Issue
The creator of the issue noticed inconsistent behaviour when using the dropna argument inside the crosstab function. He provides how the value_counts function deals with NaN values based on the dropna argument and states that crosstab function should follow the same principal to make the code base more consistent.
I will use the following two Series objects to demonstrate the behaviour of vlaue_counts and crosstab, and how crosstab should behave when the dropna argument is set to false:
x = pd.Series(['a', 'b', 'a', None, 'a'])
y = pd.Series(['c', 'd', 'd', 'c', None])
The function value_counts returns counts of unique values. By default, it does not include counts of NaN. However, when the dropna argument is set to false, value_counts will also count NaN values and return the following series:
x.value_counts(dropna=False)
As you can see, the x series contains three of a, one of b, and one of NaN. This can be verified by looking at the original x series. Setting the dropna argument to false allowed the NaN values to be counted and the function returned the count of NaN.
The function crosstab should exhibit the same behaviour because it can be seen as a two-dimensional value_counts. Crosstab returns a Dataframe, which shows a cross-tabulation between two or more factors, that shows the frequency at which groups of data appear. In our case, it will return a cross-tabulation of x and y, which will show how frequent a and b will intersect c and d.
The current behaviour crosstab produces the following Dataframe when the dropna argument is set to false:
pd.crosstab(x, y, dropna=False)
As you can see, the cross-tabulation table does not include any NaN values and only counts the frequency of a-c, a-d, b-c, and b-d. This is inconsistent and should follow the same format as value_counts uses, which includes NaN column/row that includes counts for NaN values. The expected result should produce the following:
Progress
As always I started working on this bug by exploring if this behaviour still exists in the current release of Pandas. After verifying this, I played around and researched the functions involved in this bug to help me understand the problem better and make it easier to produce a solution.
After the research and exploration, I produced a test that would help me test the expected functionality of the crosstab function with the dropna argument set to false. The can be seen below:
def test_crosstab_dropna_NaN(self):
# GH 10772
x = pd.Series(['a', 'b', 'a', None, 'a'])
y = pd.Series(['c', 'd', 'd', 'c', None])
result = pd.crosstab(x, y, dropna=False)
expected = pd.DataFrame([[1,1,1],[0,1,0],[1,0,0]],
index=pd.Index(['a', 'b', 'NaN'], name='row_0'),
columns=pd.Index(['c', 'd', 'NaN'], name='col_0'))
tm.assert_frame_equal(result, expected)
Next week, I will use this knowledge and this test to help me come up with a solution to this bug and improve the consistency of the Pandas code base.
This is still a bug FYI :)
ReplyDelete