Topics in Open Source Development

Posts

Showing posts from 2019

First Enhancement in Pandas

April 15, 2019

First Enhancement in Pandas After spending the last few months fixing bugs in Pandas and writing tests, I decided that I wanted to try something a little bit different. I went on a hunt for enhancement issues to see if I can add another type of contribution to Pandas. A lot of enhancement requests that I found were hard and technical, so I stayed away from them for now. However, I found one that didn't require a whole new functionality to be added to the code, but asked to add an extra calculation in an existing function, and I proceeded to work on it. Issue https://github.com/pandas-dev/pandas/issues/21689 The issue deals with the pd.DataFrame.describe() function in Pandas. Currently, this function provides 8 summary statistics that are performed on the DataFrame that calls this function. These summary statistics include: count, mean, std, min, 25%, 50%, 75%, max. All of these summary statistics are useful in data analysis and provide a summary of the central t...

Progress in Open Source

April 05, 2019

Progress in Open Source In this blog I will talk about my journey and progress in open source development so far. I've been really enjoying working on real world projects that are used by many people every day. It is not only challenging as every Pull Request required me to perform research in order to understand the issue and solve it, but it is also rewarding because I acquire hands-on experience that is much needed and also contribute to the open source community that I am very fond of now. Open Source Timeline Summary I started exploring open source development during September of 2018. I had to learn a lot of useful tools that Git and Github offers, which now come in handy almost every day. I now know how to create meaningful issues that follow the specifications of the project, as well as create and structure my Pull Requests to the standards of the project that I'm working on. Furthermore learning how to properly manipulate version history using ...

DataFrame from Dict to Follow Insertion Order

March 29, 2019

DataFrame from Dict to Follow Insertion Order This week I ran into a problem when working on the issue I started working on last week, so I will talk about the challenges that I faced. In addition, because the fix for the issue was not successful, I found another issue to work on. I will also be talking about this issue and how I used my previous knowledge in Pandas to make a Pull Request with a possible fix. Crosstab Dropna Issue After discovering the issue last week and creating a test to help me solve it, I started analyzing the cause for the issue and possible solutions. I discovered that the origin of the issue comes from an aggregation function that gets called inside the pivot_table function that is used when creating a crosstab DataFrame. The aggregation functions don't take into account missing values as column names and row names and perform aggregation on existing values. In this case the aggregation function performs a calculation needed for crosstab a...

Another Dropna Bug in Pandas

March 22, 2019

Another Dropna Bug in Pandas This year I fixed a few bugs that dealt with NaN values and more specifically the Dropna function in Pandas. Every bug like this made me more and more interested in fixing behaviour that is not consistent across the code base, specifically how the functions deal with non existent values. Some functions have a dropna argument. When set to true, the function should drop columns/rows that contain missing values and when set to false, the function should keep columns/rows that contain missing values. Currently, some functions deal well with NaN values, however there are functions that either produce buggy behaviour with NaN or don't deal with it consistently across the code base. This week I found another bug like this and I will be working on it throughout next week. Issue https://github.com/pandas-dev/pandas/issues/10772 The creator of the issue noticed inconsistent behaviour when using the dropna argument inside the crosstab function...

Fixing a Couple Small Issues in Pandas

March 15, 2019

Fixing a Couple Small Issues in Pandas This week I managed to fix a couple small issues in Pandas. One of them was created due to the behaviour observed in my previous Pull Request, where I ensured that the ordering of OrderedDicts and dicts in >= Python 3.6 was respected. In addition I made another simple fix while working on another issue, which didn't seem trivial at first, but in the end was just a simple mistake in the code. It was easy to fix the problem, however it took some time to actually find were the problem originated from. I will talk about fixing both of these small issues in more detail below. Replacing Dicts with OrderedDicts in Aggregation Functions of Groupby: Issue: https://github.com/pandas-dev/pandas/issues/25692 Pull Request: https://github.com/pandas-dev/pandas/pull/25693 While working on a previous Pull Request, which ensured that ordering of dicts and OrderedDicts was respected across different versions of Python, we came acr...

Working with Incomplete MultiIndex keys in Pandas

February 22, 2019

Incomplete MultiIndex keys in Dropna This week I produced a Pull Request for the bug in Pandas that I discussed last week. Although I think the solution that I provided can be further improved, it does fix the immediate issue of incomplete MultiIndex keys in the subset parameter of the "dropna()" function. I submitted a Pull Request to gather some feedback on my solution and improve it based on the moderator's discretion. Overview of the Issue https://github.com/pandas-dev/pandas/issues/17737 The creator of the issue brought to attention a bug that does not allow MultiIndex keys to be incomplete in the subset parameter of the "dropna()" function. DataFrame.dropna(...) function removes missing values. If you pass in a "subset=..." parameter into it, it will remove rows based on the specified list of columns that are listed in the subset. I created some sample code, following the examples that the moderator gave and here a...

Solving the Next Bug in Pandas

February 15, 2019

Solving the Next Bug in Pandas This week I went on another search for an issue in Pandas. I used the same tactic that I started using this year, I've been finding and solving older bugs that nobody fixed yet. There are over 2,800 issues in Pandas and it takes a while to find a bug that you're interested in working on, but when you fix the bug, you get a big feeling of satisfaction and accomplishment. But before we dive into the new bug, I will talk about a quick update from the bug I was working on last week. Update on Last Week's Bug Issue: https://github.com/pandas-dev/pandas/issues/21510 Pull Request: https://github.com/pandas-dev/pandas/pull/25224 The feedback on the issue relating ordering of regular dictionaries for Python 3.6 and 3.7 that I previously observed from the moderators was outdated. Previously, I stated that the ordering of regular dictionaries in Python 3.6 is an implementation detail of CPython, and users should not rely on it. Howev...

Fixing Order of OrderedDict in Pandas

February 08, 2019

Fixing Order of OrderedDict in Pandas This week I've been working on the same issue from my last week's blog. This week I was producing a solution to the issue, taking into account the requests of the moderators. Although I made a pull request and think I'm on the right track, I'm waiting for the moderators to give some feedback on my pull request before I proceed any further. Overview of the Issue https://github.com/pandas-dev/pandas/issues/21510 The creator of the issue has brought attention to a bug that is caused when combining concat function and OrderedDict. OrderedDict is an ordered dictionary, which means that the dictionary should remember the order of the items added to this dictionary. However, when combining this dictionary with concat, the order gets lost. Here are the observations: In [ 2 ]: from collections import OrderedDict In [ 3 ]: pd.concat(OrderedDict([( ' First ' , pd.Series( range ( 3 ))), ... : ...

Using Unit Tests to Fix a Bug in Pandas

February 01, 2019

Using Unit Tests to Fix a Bug in Pandas The issue that I decided to work on this week is one of the issues I mentioned two weeks ago when I was planning my work for the next couple of months. This time, I'm expanding on my knowledge of unit tests in Pandas from last week and using them to my advantage to fix an outstanding bug. Overview of the Issue https://github.com/pandas-dev/pandas/issues/21510 The following issue has been posted on June 16, 2018. It's a pretty old issue and one person attempted to solve it. However, the moderators requested changes from the person working on the issue and he never committed anything that would satisfy their requests. After a while, one of the moderators stated that the pull request is now stale and closed it. On December 2, 2018 after ensuring that the issue was not yet resolved, one of the moderators updated the issue to have a "Contributions Welcome" milestone. Seeing that the moderators are still inter...