Solving the Next Bug in Pandas

February 15, 2019

Solving the Next Bug in Pandas

This week I went on another search for an issue in Pandas. I used the same tactic that I started using this year, I've been finding and solving older bugs that nobody fixed yet. There are over 2,800 issues in Pandas and it takes a while to find a bug that you're interested in working on, but when you fix the bug, you get a big feeling of satisfaction and accomplishment. But before we dive into the new bug, I will talk about a quick update from the bug I was working on last week.

Update on Last Week's Bug

Issue: https://github.com/pandas-dev/pandas/issues/21510
Pull Request: https://github.com/pandas-dev/pandas/pull/25224

The feedback on the issue relating ordering of regular dictionaries for Python 3.6 and 3.7 that I previously observed from the moderators was outdated. Previously, I stated that the ordering of regular dictionaries in Python 3.6 is an implementation detail of CPython, and users should not rely on it. However, in Python 3.7 ordering of regular dictionaries becomes a language feature. Although this statement is still correct, Pandas ensured that users can rely on ordering of regular dictionaries in Python 3.6.

I received this update on my Pull Request, along with a request to add a line of documentation in the change log to reflect my bug. I fixed my code and ensured that the ordering of dictionaries and OrderedDicts followed the correct order in Python 3.6 and above. The following are my changed:

Changed the dict_keys_to_ordered_list function to use >= Python 3.6:

Ensured that the tests that previously used sorted now use the new function:

Added a line of documentation to the change log:

Included a check for version and adjusted expected result in one of the test:

In addition to ensuring that everything is functioning with Python 3.6 and above, I also had to work on one of the tests that was failing due to my change. In this test, one of the functions uses concat, meaning that the expected result will be different based on the version of Python. However, there is only one expected result generated. I had to manually reverse the order of the columns in the test and ensure that the test produces the correct expected result based on the version of Python.

New Issue

https://github.com/pandas-dev/pandas/issues/17737

I found this issue by looking through older Pandas issues and I got interested in solving it. It was posted on October 2, 2017 and since then nobody has worked on it. I started working on it this week and will continue my work into the next week.

The creator of the issue brought to attention a bug that does not allow MultiIndex keys to be incomplete in the subset parameter of the "dropna()" function.

DataFrame.dropna(...) function removes missing values.

If you pass in a "subset=..." parameter into it, it will remove rows based on the specified list of columns that are listed in the subset.

I created some sample code, following the examples that the moderator gave and here are my observations:

Sample Code:

_____________________________________________

Running this produces an error:

_____________________________________________

Expected Output:

_____________________________________________

Progress

Apart from creating sample code to run in order to further explore the issue, I also created a test that will test the bug. For now, it is a very basic test, but I will parametrize it to include more test cases next week.

Now that I have explored the issue and I understand what the bug is and I have a test to test the work that I will be doing, it is time to jump in and actually fix the bug. Next week I will be jumping into the code base and attempting to produce a Pull Requests that solves this issue.

Search This Blog

Topics in Open Source Development