Working with Incomplete MultiIndex keys in Pandas

Incomplete MultiIndex keys in Dropna


This week I produced a Pull Request for the bug in Pandas that I discussed last week. Although I think the solution that I provided can be further improved, it does fix the immediate issue of incomplete MultiIndex keys in the subset parameter of the "dropna()" function. I submitted a Pull Request to gather some feedback on my solution and improve it based on the moderator's discretion.

Overview of the Issue

The creator of the issue brought to attention a bug that does not allow MultiIndex keys to be incomplete in the subset parameter of the "dropna()" function. 

DataFrame.dropna(...) function removes missing values.
If you pass in a "subset=..." parameter into it, it will remove rows based on the specified list of columns that are listed in the subset.

I created some sample code, following the examples that the moderator gave and here are my observations:

Sample Code:
_____________________________________________


Running this produces an error:
_____________________________________________



Expected Output:
_____________________________________________


Solution


In my last blog I stated that I started my attempts to fix this bug by writing a unit test that will ensure that incomplete MultiIndex keys will be able to be passed into the subset parameter of the "dropna()" function. This week I improved this test and added parameterization to it to check for different test cases to ensure better test results.

Updated Test:

@pytest.mark.parametrize('input_vals', [
('other', '', ''), ('I', 'a', 'x')
])
def test_dropna_subset(self, input_vals):
# GH 17737
col = pd.MultiIndex.from_product([['I', 'II'],
['a', 'b'],
["x", "y"]])
df = pd.DataFrame(index=range(3), columns=col)
df['other'] = (1, 3, np.nan)

result = df.dropna(subset=[input_vals[0]])
expected = df.dropna(subset=[input_vals])

tm.assert_frame_equal(result, expected)


In addition to adding parameterization to the test, I also expanded the levels of the MultiIndex to ensure that an incomplete key will be functional with a MultiIndex that contains 2 or more levels. Previously, "dropna()" function was only able to remove missing values if a user specified a complete MultiIndex key, for example subset=[('I', 'a', 'x')]. However now, a user can just specify subset=['I'] and missing values will be removed based on all other MultiIndex levels that are below 'I'.

Expanded MultiIndex (3 levels):


    I                                     II                              other
    a                 b                  a                 b           
    x        y       x        y       x        y        x        y      
0  1  NaN  NaN  NaN  NaN  NaN  NaN  NaN   1.0
1  3  NaN  NaN  NaN  NaN  NaN  NaN  NaN   3.0
2  4  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN


With an improved test, I was able to produce a working solution that enabled incomplete MultiIndex keys to be passed into the "dropna()" function. I achieved this through adding the following code to the "dropna()" function of the DataFrame:

if isinstance(ax, MultiIndex):
if (len(subset[0]) < ax.nlevels and isinstance(subset[0], tuple)):
for _ in range(ax.nlevels - len(subset[0])):
ax = ax.droplevel(ax.nlevels - 1)
elif ax.nlevels > 1 and isinstance(subset[0], str):
for _ in range(ax.nlevels - 1):
ax = ax.droplevel(ax.nlevels - 1)


In this code, I check if the DataFrame has MultiIndex keys. I then check if the length of the subset element that is passed into the "dropna()" function has a shorter length than the levels of the MultiIndex or if it consists a string, which means one level. I then drop the levels that don't exist in the subset, which will allow the "dropna()" function to go over every column with the incomplete MultiIndex key.

For example, If we take the DataFrame created in the test above and perform the following:


df.dropna(subset=['I'])


The MultiIndex of the DataFrame will be transformed into the following:



Index(['I', 'I', 'I', 'I', 'II', 'II', 'II', 'II', 'other'], dtype='object')


Now, the "dropna()" function will be able to go through all the columns with "I" and remove rows based on missing values in these columns. My changes don't affect the original DataFrame, so it is safe to use and will not affect the MultiIndex of the original DataFrame.





Comments

Popular posts from this blog

First Enhancement in Pandas

Progress in Open Source