Posts

First Enhancement in Pandas

Image
First Enhancement in Pandas After spending the last few months fixing bugs in Pandas and writing tests, I decided that I wanted to try something a little bit different. I went on a hunt for enhancement issues to see if I can add another type of contribution to Pandas. A lot of enhancement requests that I found were hard and technical, so I stayed away from them for now. However, I found one that didn't require a whole new functionality to be added to the code, but asked to add an extra calculation in an existing function, and I proceeded to work on it. Issue https://github.com/pandas-dev/pandas/issues/21689 The issue deals with the pd.DataFrame.describe() function in Pandas. Currently, this function provides 8 summary statistics that are performed on the DataFrame that calls this function. These summary statistics include: count, mean, std, min, 25%, 50%, 75%, max. All of these summary statistics are useful in data analysis and provide a summary of the central t

Progress in Open Source

Image
Progress in Open Source In this blog I will talk about my journey and progress in open source development so far. I've been really enjoying working on real world projects that are used by many people every day. It is not only challenging as every Pull Request required me to perform research in order to understand the issue and solve it, but it is also rewarding because I acquire hands-on experience that is much needed and also contribute to the open source community that I am very fond of now. Open Source Timeline Summary I started exploring open source development during September of 2018. I had to learn a lot of useful tools that Git and Github offers, which now come in handy almost every day. I now know how to create meaningful issues that follow the specifications of the project, as well as create and structure my Pull Requests to the standards of the project that I'm working on. Furthermore learning how to properly manipulate version history using

DataFrame from Dict to Follow Insertion Order

Image
DataFrame from Dict to Follow Insertion Order This week I ran into a problem when working on the issue I started working on last week, so I will talk about the challenges that I faced. In addition, because the fix for the issue was not successful, I found another issue to work on. I will also be talking about this issue and how I used my previous knowledge in Pandas to make a Pull Request with a possible fix. Crosstab Dropna Issue After discovering the issue last week and creating a test to help me solve it, I started analyzing the cause for the issue and possible solutions. I discovered that the origin of the issue comes from an aggregation function that gets called inside the pivot_table function that is used when creating a crosstab DataFrame. The aggregation functions don't take into account missing values as column names and row names and perform aggregation on existing values.  In this case the aggregation function performs a calculation needed for crosstab a

Another Dropna Bug in Pandas

Image
Another Dropna Bug in Pandas This year I fixed a few bugs that dealt with NaN values and more specifically the Dropna function in Pandas. Every bug like this made me more and more interested in fixing behaviour that is not consistent across the code base, specifically how the functions deal with non existent values. Some functions have a dropna argument. When set to true, the function should drop columns/rows that contain missing values and when set to false, the function should keep columns/rows that contain missing values. Currently, some functions deal well with NaN values, however there are functions that either produce buggy behaviour with NaN or don't deal with it consistently across the code base. This week I found another bug like this and I will be working on it throughout next week. Issue https://github.com/pandas-dev/pandas/issues/10772 The creator of the issue noticed inconsistent behaviour when using the dropna argument inside the crosstab function

Fixing a Couple Small Issues in Pandas

Image
Fixing a Couple Small Issues in Pandas This week I managed to fix a couple small issues in Pandas. One of them was created due to the behaviour observed in my previous Pull Request, where I ensured that the ordering of OrderedDicts and dicts in >= Python 3.6 was respected. In addition I made another simple fix while working on another issue, which didn't seem trivial at first, but in the end was just a simple mistake in the code. It was easy to fix the problem, however it took some time to actually find were the problem originated from. I will talk about fixing both of these small issues in more detail below. Replacing Dicts with OrderedDicts in Aggregation Functions of Groupby: Issue:  https://github.com/pandas-dev/pandas/issues/25692 Pull Request:  https://github.com/pandas-dev/pandas/pull/25693 While working on a previous Pull Request, which ensured that ordering of dicts and OrderedDicts was respected across different versions of Python, we came across o

Working with Incomplete MultiIndex keys in Pandas

Image
Incomplete MultiIndex keys in Dropna This week I produced a Pull Request for the bug in Pandas that I discussed last week. Although I think the solution that I provided can be further improved, it does fix the immediate issue of incomplete MultiIndex keys in the subset parameter of the "dropna()" function. I submitted a Pull Request to gather some feedback on my solution and improve it based on the moderator's discretion. Overview of the Issue https://github.com/pandas-dev/pandas/issues/17737 The creator of the issue brought to attention a bug that does not allow MultiIndex keys to be incomplete in the subset parameter of the "dropna()" function.  DataFrame.dropna(...) function removes missing values. If you pass in a "subset=..." parameter into it, it will remove rows based on the specified list of columns that are listed in the subset. I created some sample code, following the examples that the moderator gave and here a

Solving the Next Bug in Pandas

Image
Solving the Next Bug in Pandas This week I went on another search for an issue in Pandas. I used the same tactic that I started using this year, I've been finding and solving older bugs that nobody fixed yet. There are over 2,800 issues in Pandas and it takes a while to find a bug that you're interested in working on, but when you fix the bug, you get a big feeling of satisfaction and accomplishment. But before we dive into the new bug, I will talk about a quick update from the bug I was working on last week. Update on Last Week's Bug Issue:  https://github.com/pandas-dev/pandas/issues/21510 Pull Request:  https://github.com/pandas-dev/pandas/pull/25224 The feedback on the issue relating ordering of regular dictionaries for Python 3.6 and 3.7 that I previously observed from the moderators was outdated. Previously, I stated that the ordering of regular dictionaries in Python 3.6 is an implementation detail of CPython, and users should not rely on it. Howev