Unit Tests in Pandas

Contributing Unit Tests to Pandas


The issue that I decided to work on this week was one of the easier issues discussed in my last blog, however it gave me valuable experience with unit testing in Python and Pandas specifically. I decided to start my 2019 open-source contributions with unit testing because I haven't had much experience with unit testing previously in Python. 

Overview of the Issue

As I previously stated in my last blog, this issue shed light on an issue with using NaT when converting a Series to a Period using the dt.to_period('D') function. Other datetime64[ns] objects with values have no problem converting into a Period, however when a NaT was passed in, it stayed as datetime64[ns]. NaT represents a missing value for datetime objects. Since the issue was posted, it has been resolved with current updates to the master branch. However, one of the moderators asked for unit tests that would ensure that the functionality of dt.to_period('D') function works as expected.

print(pd.to_datetime(['2001']).to_series().dt.to_period('D').dtype)
print(pd.to_datetime(['NaT']).to_series().dt.to_period('D').dtype)

Previous Result
object
datetime64[ns]

Current Result (correct)
period[D]
period[D]

Setting Up Test Environment in Pandas

Once you get the most up-to-date version of the Pandas repository, you will need to build the C extensions in order to run the tests in Pandas. In order to build the C extensions, you can run the following command from the root Pandas folder:

python setup.py build_ext --inplace -j 4

After the C extensions are built, you can now run pytest in order to run the unit tests in Pandas. With the pytest command, all the tests can be run at once, or singular test files can be run separately. All tests take a long time to run, so I would suggest focusing on the specific file that you are modifying. 

To run all tests from Panda's root directory: 

pytest pandas

To run a specific test file from Panda's root directory:

pytest pandas/tests/api/test_api.py

If you add a -v argument, additional details will be displayed about each test that runs:

pytest pandas/tests/api/test_api.py -v

All unit tests in Pandas are organized under the pandas/tests directory. The project does a very great job separating the tests for different functionalities in different directories and files. Whenever you need to write the test, figure out the functions you are testing and follow the structure in pandas/tests to find the corresponding tests.

Writing the Unit Test




In my case, I needed to test the dt.to_period('D') functionality that converts a Series to a Period. First I went into the pandas/tests directory and started looking for either a series or a period folder. Found the series folder and saw test_period.py file, which was a good spot to put the test and was easily found. In order for me to make sure that my tests were working properly, I had to run the following command:

pytest pandas/tests/series/test_period.py -v

After making sure that my environment is working, I started playing around, trying to test the to_period functionality. Since I've never written tests for Pandas, I was unfamiliar with the way the tests are written. I first explored other tests inside the file to get an idea on how to begin writing mine. After I got an idea of how other tests work, I wrote a simple test that tested the exact functionality specified in the issue. I knew that it was not the optimal test, but I also know that moderators in Pandas are amazing with feedback and will always point you to the right direction. So, I created a pull request with what I've done so far and asked the moderator for feedback.

Initial Attempt:



As expected, the moderator pointed out some improvements that I can make to the test I made. He pointed out that the test can be parameterized instead of testing 2001 and NaT separately. This way, if any more parameters need to be tested, they can just be added to the parameter list at the top of the function. In addition, he suggested to construct the expected result and compare against the produced result. This way the whole object is compared and not just the type.

Improvement:


I integrated moderator's suggestions into my test. Now the input values get passed into the function through @pytest.mark.parametrize. Inside the function, the input value is used to test dt.to_period('D') function. Finally, the result is compared to the expected result.




Comments

Popular posts from this blog

First Enhancement in Pandas

Working with Incomplete MultiIndex keys in Pandas

Progress in Open Source