Using Unit Tests to Fix a Bug in Pandas

February 01, 2019

Using Unit Tests to Fix a Bug in Pandas

The issue that I decided to work on this week is one of the issues I mentioned two weeks ago when I was planning my work for the next couple of months. This time, I'm expanding on my knowledge of unit tests in Pandas from last week and using them to my advantage to fix an outstanding bug.

Overview of the Issue

https://github.com/pandas-dev/pandas/issues/21510

The following issue has been posted on June 16, 2018. It's a pretty old issue and one person attempted to solve it. However, the moderators requested changes from the person working on the issue and he never committed anything that would satisfy their requests. After a while, one of the moderators stated that the pull request is now stale and closed it.

On December 2, 2018 after ensuring that the issue was not yet resolved, one of the moderators updated the issue to have a "Contributions Welcome" milestone. Seeing that the moderators are still interested in a fix for this issue, I thought that it would be a good issue for me to work on.

At first, I was a little intimidated by "Difficulty Intermediate", but after looking further into the issue, I realized that there is already existing feedback that will help me get started on the issue without asking unnecessary questions.

The creator of the issue has brought attention to a bug that is caused when combining concat function and OrderedDict. OrderedDict is an ordered dictionary, which means that the dictionary should remember the order of the items added to this dictionary. However, when combining this dictionary with concat, the order gets lost. Here are the observations:

In [2]: from collections import OrderedDict

In [3]: pd.concat(OrderedDict([('First', pd.Series(range(3))),
   ...:                        ('Another', pd.Series(range(4)))]))
   ...:                        
Out[3]: 
Another  0    0
         1    1
         2    2
         3    3
First    0    0
         1    1
         2    2
dtype: int64

However, this should be the output:

First    0    0
         1    1
         2    2
Another  0    0
         1    1
         2    2
         3    3
dtype: int64

Writing the Unit Test

My first thought before attempting to fix this issue was to write a unit test that would test the above functionality. This would allow me to easily test the changes that I make to the code. With the experience that I acquired from my last pull request, where I wrote a unit test for Pandas, it was a lot easier to formulate one that would be good for this issue.

I followed the same principle of creating the buggy result and comparing it to the expected result, which I also had to recreate. You can observe the code of my test below.

When I first ran the following test, I got the following:

As you can see, the error that I get is expected. The left input is the result that I created based on the buggy behaviour and the right input is the expected behaviour. The following test matches the concern of the issue that I'm working on and would help me produce a fix.

If I manage to get the issue fixed, the unit test is also a valuable part of it because it will ensure that the bug doesn't occur again. Whenever something is added to the project, it is beneficial to also write a test for it, so other pull requests don't break the functionality that was just added.

Progress on the Bug

Now that the unit test is up and running, I started exploring ways to fix the actual bug. I looked at the pull request that was previously created to see what the previous person who attempted to work on this issue achieved. He achieved the expected behaviour and it allowed me to pass the test that I created, however the moderators requested changes on what he did, which were never addressed. Next week I will attempt to address these changes and get further feedback from the moderators once I create my pull request.

The following code was previously implemented to solve the bug:

The summary of change requests that followed:

In Python 3.7, regular dicts are also ordered and should be tested
However, implementation should separate 3.6 and 3.7. There are PY36 and PY37 constants that can be used for this check.
pandas.core.common._dict_keys_to_ordered_list should be refactored to help with this check.

Next week I will be using all this feedback and requests in order to come up with a solution.

Search This Blog

Topics in Open Source Development