Fixing Order of OrderedDict in Pandas

Fixing Order of OrderedDict in Pandas


This week I've been working on the same issue from my last week's blog. This week I was producing a solution to the issue, taking into account the requests of the moderators. Although I made a pull request and think I'm on the right track, I'm waiting for the moderators to give some feedback on my pull request before I proceed any further.

Overview of the Issue




The creator of the issue has brought attention to a bug that is caused when combining concat function and OrderedDict. OrderedDict is an ordered dictionary, which means that the dictionary should remember the order of the items added to this dictionary. However, when combining this dictionary with concat, the order gets lost. Here are the observations:

In [2]: from collections import OrderedDict

In [3]: pd.concat(OrderedDict([('First', pd.Series(range(3))),
   ...:                        ('Another', pd.Series(range(4)))]))
   ...:                        
Out[3]: 
Another  0    0
         1    1
         2    2
         3    3
First    0    0
         1    1
         2    2
dtype: int64

However, this should be the output:

First    0    0
         1    1
         2    2
Another  0    0
         1    1
         2    2
         3    3
dtype: int64

The summary of moderator requests:
  • In Python 3.7, regular dicts are also ordered and should be tested
  • However, implementation should separate 3.6 and 3.7. There are PY36 and PY37 constants that can be used for this check.
  • pandas.core.common._dict_keys_to_ordered_list should be refactored to help with this check.

Progress on the Issue

I created a Pull Request that attempts to fix the issue, incorporating the requests of the moderators. I'm currently waiting for further feedback to ensure that I'm going in the right direction.

I created this test in order to test the order of OrderedDicts:



I changed the following function to use PY37 to sort the keys in dictionaries:


From the feedback that I observed from the moderators. The ordering of regular dictionaries in Python 3.6 is an implementation detail of CPython, and users should not rely on it. However, in Python 3.7 ordering of regular dictionaries becomes a language feature. Based on this, I changed this function to use Python 3.7 instead of 3.6.

I then use this function to ensure that keys are sorted properly when using the concat function:


In addition, a couple of tests that previously used dict_keys_to_ordered_list needed to be altered to use Python 3.7 in order for the tests to pass.

I'm currently waiting for feedback from the moderators to ensure that I'm on the right path, or if I would need to reconsider the way I fixed the issue. Right my Pull Request fixes the issue and incorporates previous requests of the moderator, however a couple tests fail on Azure Pipelines when using Python 3.7. Everything works as expected when using Python 3.6, which I was using to test the issue.

In order to fix the tests that fail on Python 3.7 I attempted to setup a virtual environment, which would allow me to switch between Python 3.6.6 and Python 3.7.2, however I ran into some issues that I will describe in the next section. I will be working on a different solution next week in order to fix these issues and coordinate with moderators in order to ensure that my Pull Request will be merged into Pandas.


Problems with Additional Python Versions

I found a really great tool called Pyenv that allows to easily switch between Python versions. It works as follows when using macOS:

First, install Homebrew by running the following command:
  • /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Second, install Pyenv:

  • brew install pyenv

To install an additional Python version:

  • pyenv install 3.7.2

      Note: On macOS Mojave (10.14) this will fail and you will need to run an extra installation.
      sudo installer -pkg
    /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg  
    -target /

To check all installed versions:

  • pyenv versions

To switch between versions globally:

  • pyenv global 3.7.2

      Note: If you run into any problems with this, you'll need an additional command.
      eval "$(pyenv init -)"

After playing around for a while and finally setting everything up, it seemed to be working perfectly and I could switch between the original Python version on my computer, which is 3.6.6 and a version provided by Pyenv, which is 3.7.2. However another problem appeared, I could not run Pytest on the version that was installed by Pyenv.

Through research and solutions provided by other people, I could not figure out how to fix this. I realized that this was due to Anaconda. Anaconda is an open-source distribution of Python/R that I use for machine learning on my research projects. Pyenv looks for the original directory where Pytest should exist and doesn't find it, because Pytest exists in the Anaconda directories when you use Anaconda as your distribution of Python.

In order to fix this issue, I tried installing an Anaconda version of Python through Pyenv, but the latest one they have is 3.7.0, which fails to build the tests in Pandas.

In the end, I could not figure out how to make my situation work with Pyenv and next week will need to adopt a different strategy by creating virtual environments through Anaconda.




Comments

Popular posts from this blog

Another Dropna Bug in Pandas

Unit Tests in Pandas

Progress in Open Source