Open Source Contribution - Release 0.2.4

Contributing to Pandas



Pandas is a famous open-source library for Python that provides high-performance and easy-to-use data structures and data analysis tools. It is often used in the machine learning field. Pandas project has over 1,300 contributors and 18,000 commits, which makes it one of the biggest open-source projects related to the machine learning field.

Hacktoberfest 2018

This year I am participating in Hacktoberfest. Throughout the month of October, I will need to make 5 pull requests with contributions to one or more projects on Github. This is my fourth contribution to Hacktoberfest and I came across the Pandas project on Github for this contribution. 

Throughout this Hacktoberfest I'm trying to find different ways to contribute to the open-source community. I've already explored non-code contributions and how they can be useful. I also explored contributing to start up project. Now, in order to gain experience with working on a large open-source project, I chose Pandas. It has over 2,600 open issues and I thought that I could surely find some to work on. In addition, compared to other projects, Pandas is very thorough with their issues and pull requests and make sure they label everything, making it easy to find issues to contribute to. In my experience, many other large projects lack labels and guidance for new developers, making it hard to find what to work on.


My Contribution

Pandas recently started integrating the isort library within their project. Essentially the isort library performs a sort on import statements inside files and groups them into sections. This is beneficial in a large project because a lot of imports are utilized and are very hard to go through when unsorted. 

After performing some testing and integration with the isort library, one of Pandas' contributors created an issue asking for the sorting to be performed on certain directories of the project. I chose to work on two directories inside the pandas/core directory, which were groupby and dtypes. I created two different pull request for each directory to keep my contribution organized.


One of the hardest challenges of this contribution wasn't actually sorting the imports within the directories, but working in an environment of a large project. Large projects often integrate CI (Continuous Integration) checks like Travis CI to ensure that the contributions pass all testing before being integrated into the project, so that the whole project doesn't break because of a single commit. Through experience of this contribution I learned how to identify the errors that fail CI checks and how to work with the community to solve these errors. A few times I came across an issue that had nothing to do with what I've done, but it was related to a pull request that was recently merged, and my changes somehow triggered this issue. So, I had to work with contributors of recent pull requests to attempt to solve these issues. Although, it might seem annoying to deal with things that weren't caused by you, I realized that when these issues arise early, it is helpful to the project because they shine light on issues that might come up later on.


When I sorted the imports in the pandas/core/groupby directory, I had absolutely no issues and it get merged fast, which gave me a great feeling of accomplishment. However, sorting the pandas/core/dtypes directory gave me a lot of trouble because of recent changes to the files within that directory that would cause conflicts between my branch and the master branch, and cause errors within the CI checks. Dealing with these conflicts and errors was an enormous experience for me and made me learn one of the hardest challenges of working on a big project.





Link to the pull request:


List of things that I contributed to Pandas:
  • Sorted the imports and separated them into sections using the isort library in the following directories:
    • pandas/core/groupby
    • pandas/core/dtypes
  • Ensured that the isort library worked as expected.
  • Dealt with errors that occurred during CI checks and ensured that my changes are up to project standards in order to merge them with the project.

Conclusion

One of the hardest challenges with working on a big project is ensuring that your changes pass all testing and CI checks. It takes a lot of effort and communication with other contributors to ensure that your changes are up to the standards of the project. However, all this effort pays off in the end! I got an enormous feeling of accomplishment and pride when my pull requests passed all CI checks and get merged with the project. This was my first attempt at contributing to such a large scale project and I can't wait to contribute more!


Comments

Popular posts from this blog

Another Dropna Bug in Pandas

Unit Tests in Pandas

Progress in Open Source