Planning Pandas Contributions for the Next Month

Outstanding Pandas Issues


Currently Pandas has over 2,700 issues that are still open. Most of these issues are currently taken by somebody or are being worked on. However, with such a big project and so many issues, some people don't succeed in solving what they started, or simply lost interest in solving the issue that they asked to work on. Once the moderators notice that the issue is no longer being worked on, they close the stale pull requests referencing the issue and try to update the issue in order to let people know that it still needs to be solved.

This week I went on a hunt for abandoned issues from 2017 and 2018. There are three issues that I found that I would like to work on within the next month and attempt to solve them. In addition to the three issues, I have found some extra issues that I would be working on in case I'm unsuccessful with any of the top three issues.

Issue 1 - Casting Data Types

The following issue has been posted on June 17, 2018. Since it has been posted, nobody has commented on the issue and decided to take it upon themselves. Since there is no competition for the issue, I think it's a great place for me to start exploring possible solutions. In addition, there is an "Effort Medium" tag on the issue, which would be a good step up from the previous issues that I've solved for Pandas.

The creator of the issue raises a concern for some discrepancies of casting the data types. The data types float and int can be cast into bool, however the reverse cannot be accomplished. The following are his observations:

In [2]: def cast_or_not(dtype1, dtype2):
   ...:     s = pd.Series(0, index=range(3), dtype=dtype1)
   ...:     s[0] = dtype2(1)
   ...:     return s.dtype == dtype1

In [3]: cast_or_not(bool, float)
Out[3]: True

In [4]: cast_or_not(float, bool)
Out[4]: False

In [5]: cast_or_not(bool, int)
Out[5]: True

In [6]: cast_or_not(int, bool)
Out[6]: False

In [7]: cast_or_not(float, int)
Out[7]: True

In [8]: cast_or_not(int, float)
Out[8]: True

As you can see, 
  • float can be cast into bool, however bool cannot be cast into float
  • int can be cast into bool, however bool cannot be cast into int
  • float can be cast into int, and int can be cast into float
In order to solve the issue, all the above tests will need to output "True". I will be looking through the code of Pandas that converts the types in the above example and will attempt to solve this issue. 

Issue 2 - Concat with OrderedDict

The following issue has been posted on June 16, 2018. Since it was posted, there have been some attempts to fix the issue, however the pull request addressing the issue went stale and closed by the moderator, meaning that the issue is once again open. In addition, this issue has been added to the 0.24.0 release milestone, which is upcoming soon. 

This issue has a tag "Difficulty Intermediate", which is intimidating, however would be a good challenge for me. After all, I said that I will be tackling more difficult issues in the next few upcoming months.

The creator of the issue has brought attention to a bug that is caused when combining concat function and OrderedDict. OrderedDict is an ordered dictionary, which means that the dictionary should remember the order of the items added to this dictionary. However, when combining this dictionary with concat, the order gets lost. Here are the observations:

In [2]: from collections import OrderedDict

In [3]: pd.concat(OrderedDict([('First', pd.Series(range(3))),
   ...:                        ('Another', pd.Series(range(4)))]))
   ...:                        
Out[3]: 
Another  0    0
         1    1
         2    2
         3    3
First    0    0
         1    1
         2    2
dtype: int64

However, this should be the output:

First    0    0
         1    1
         2    2
Another  0    0
         1    1
         2    2
         3    3
dtype: int64

In order to fix this issue, I will be researching the concat function and OrderedDict within Pandas more in-depth to understand what causes the issue and how I can solve it.

Issue 3 - Returns Wrong Dtype

The following issue has been posted on May 25, 2018. Although, it looks like nobody has been working on this issue since then, the bug that was addressed has been fixed on the master branch. However, 4 days ago, one of the moderators commented that some tests for this issue would be appreciated.

Unlike the other two issues, this issue seems a little bit easier because it requires tests instead of fixing actual functionalities. So, it might be a great place to start this year and explore tests in Pandas before moving onto the more difficult fixes of the functionalities.

The creator of the issue noticed unexpected behaviour when calling the Series.dt.to_period function when using NaTs. NaT represents a missing value for datetime objects. This issue has been resolved based on what the moderator stated, however he is requesting tests to ensure that the behaviour of the to_period function works as expected.

List of Backup Issues

Comments

Popular posts from this blog

First Enhancement in Pandas

Working with Incomplete MultiIndex keys in Pandas

Progress in Open Source