DataFrame from Dict to Follow Insertion Order

March 29, 2019

DataFrame from Dict to Follow Insertion Order

This week I ran into a problem when working on the issue I started working on last week, so I will talk about the challenges that I faced. In addition, because the fix for the issue was not successful, I found another issue to work on. I will also be talking about this issue and how I used my previous knowledge in Pandas to make a Pull Request with a possible fix.

Crosstab Dropna Issue

After discovering the issue last week and creating a test to help me solve it, I started analyzing the cause for the issue and possible solutions. I discovered that the origin of the issue comes from an aggregation function that gets called inside the pivot_table function that is used when creating a crosstab DataFrame. The aggregation functions don't take into account missing values as column names and row names and perform aggregation on existing values.

In this case the aggregation function performs a calculation needed for crosstab and does not output NaN as a column name or row name. I realized that by trying to manipulate the output of the aggregation function I would affect everything else that uses the aggregation function, which is not needed. I found one possible solution to the problem, however it would be too much of a "bandaid" fix and it already caused a problem in one of the tests that deals with a more complicated DataFrame.

    if dropna is False:

        df = df.fillna('NaN')

The bandaid fix consists of using the fillna function to fill all missing values with 'NaN' placeholders in order for the aggregation function to take them into account. Similar to how dropna function drops the missing values, the fillna function fills the missing values with a specified value. This bandaid fix would allow the test I created to pass and produce a DataFrame that included 'NaN' as one of the columns and rows when the dropna argument is false.

Although, this fix would produce a DataFrame that would include NaN fields, there are a couple problems. The first problem is that NaN position will not be consistent as it will follow the alphabetical order and may show up in the beginning, middle, or end. And restructuring the DataFrame after to ensure that it's at the end would add a lot of redundant code. Another problem is that this started causing some problems in more complicated DataFrames that use margins.

Unless I create a separate function or my own code to deal with this situation, I haven't found another good way to use for this case. And the "bandaid" fix is not a viable solution in my opinion. So for now I've moved on looking for another issue and I was successful at finding a recent one that reminded me of another issue that I worked this year.

DataFrame Insertion Order Issue

Right when I realized that I currently will not be able to fix the crosstab dropna issue in an efficient manner, I opened Pandas' issues to take a look if there is another recent issue that I can work on in the meantime. I found one that reminded me of an issue that I worked on that dealt with ordering of dictionaries in concat.

The creator of the issue brought to attention a piece of documentation that states that a Series created from a dictionary should follow insertion order. With this in mind he provided an example of a DataFrame, where the order does not follow insertion order and is sorted.

By using the following data to create a DataFrame:

The following DataFrame is created:

As you can see, the DataFrame does not follow the insertion order and returns rows in a sorted order. The creator of the issue requested that insertion order should be followed. I thought that this was right up my alley and dove in to the code to find where the problem is. As always I started with a test that I would use to test this functionality.

    def test_constructor_dict_order(self):

        data = {'B': {'b': 1, 'c': 2, 'a': 3},

                'A': {'b': 3, 'c': 2, 'a': 1}}

        result = pd.DataFrame(data)

        expected = pd.DataFrame([[1, 3], [2, 2], [3, 1]],

                                index=['b', 'c', 'a'],

                                columns=['B', 'A'])

        tm.assert_frame_equal(result, expected)

I then started tracing the DataFrame constructor to the origin of the problem. Surprisingly, all the knowledge I previously acquired in Pandas allowed me to do this with ease and after jumping around a few files and functions, I found exactly where the problem originated from. One of the functions that was used along the path had the parameter of "sort" set to true. In order to fix this, I had to set it to false, which fixed the problem and allowed the DataFrame to be constructed following dict's insertion order.

However, this fix caused some tests to fail because the ordering of DataFrames was now changed. I fixed a few of these errors and submitted a Pull Request. After submitting the Pull Request, one of the moderators asked me what the order would be if all the dictionaries did not follow the same order. To which, the response was that they would follow the order of the first dictionary. He thought that this might not be very consistent, to which I agreed. With this in mind, we started talking about leaving the DataFrame construction with ordered sort because it would make the behaviour consistent across all cases. Now we need some input from other moderators, but it is looking like I will have my first "Closed" pull request in Pandas instead of "Merged", which is a bit upsetting. However, I still enjoyed the experience of solving a problem and having a good discussion on how it would impact Pandas. After outweighing the pros and cons, it was decided that the cons outweigh the pros, which means that it should not be implemented.

Search This Blog

Topics in Open Source Development