First Enhancement in Pandas

First Enhancement in Pandas


After spending the last few months fixing bugs in Pandas and writing tests, I decided that I wanted to try something a little bit different. I went on a hunt for enhancement issues to see if I can add another type of contribution to Pandas. A lot of enhancement requests that I found were hard and technical, so I stayed away from them for now. However, I found one that didn't require a whole new functionality to be added to the code, but asked to add an extra calculation in an existing function, and I proceeded to work on it.

Issue

The issue deals with the pd.DataFrame.describe() function in Pandas. Currently, this function provides 8 summary statistics that are performed on the DataFrame that calls this function. These summary statistics include: count, mean, std, min, 25%, 50%, 75%, max. All of these summary statistics are useful in data analysis and provide a summary of the central tendency, dispersion and shape of a dataset's distribution. 




However, the creator of the issue stated that there is one more summary statistic that could be added to the pd.DataFrame.describe() function that is also very useful for data analysis. He requested for a "missing" summary statistic to be added to the function, which would return a count of missing values in the DataFrame. Currently, one of the statistics includes "count", which returns a count of values excluding missing values. So, "missing" would do the opposite and return a count of missing values. 


Once I spotted this issue, it immediately drew my attention because just a couple of weeks ago, when I was performing data analysis in the machine learning research project that I'm currently working on, one of the statistics that we needed to calculate for our evaluation phase was the number of missing values. So from my own experience in data analytics, I could tell that this was a nice enhancement to the describe() function.

Pull Request


As always I started by writing a test that would help me add this enhancement and ensure that it works as expected in the pd.DataFrame.describe() function.

def test_missing_describe(self):
df = pd.DataFrame(data={'col1': [1, np.nan],
'col2': [3, 4]})
result = df.describe()
expected = pd.DataFrame({'col1': [1, 1, np.nan, 1, 1, 1, 1, 1, 1],
'col2': [2, 3.5, 0.707107, 3, 3.25, 3.5,
3.75, 4, 0]},
index=['count', 'mean', 'std', 'min', '25%',
'50%', '75%', 'max', 'missing'])
tm.assert_frame_equal(result, expected)

I then started to explore the describe() function and figured out how it works and where it calls the calculations for the summary statistics. I added the "missing" summary statistic that would calculate the amount of missing values with series.isna().sum().


This allowed the describe() function to also include a summary of missing values and produced the DataFrame that the creator of the issue expected. I then had to go and change a few tests to also include this new addition. However, one of the moderators commented on the issue, stating that instead of adding the "missing" summary statistic, a "length" statistic should be added. The "length" statistic would show the total amount of values, including missing values. So in order to calculate missing values, user would have to subtract "count" statistic from the "length" statistic. I thought that having "missing" statistic would be better and am waiting for feedback on my Pull Request.

Once we agree to have "missing" or "length", I will have to change the documentation of the function to reflect the new change and change the rest of the tests that are currently failing due to this new addition.

Overall, this was a great experience and I'm glad that I chose to work on an enhancement this time, because it will give me extra experience with working on enhancements instead of bugs. In addition, I will also get some experience with changing documentation in Pandas based on the new functionality that was added.




Comments

Popular posts from this blog

Working with Incomplete MultiIndex keys in Pandas

Progress in Open Source