Open Source Contribution - Release 0.4 Part 3

Linting Errors in Pandas


Building onto the previous blog where I explored style and linting errors in the pySearch project and the tools used to find these errors, last week I was dealing with similar issues in Pandas. On top of the Flake8 and PEP8 Speaks that the Pandas project uses, they also integrated their own script file that checks for their custom made errors. Since I already explored Flake8 and PEP8 Speaks in-depth, I decided to focus on custom errors within Pandas and to find ways to solve the errors which are difficult to find the origin for.

GL07 Error

The error that I focused on fixing last week was: GL07 - Sections are in the wrong order. As I stated before, Pandas is a massive project and in order to help developers remember the purpose and use of all the functions, as well as for new contributors to understand the functionalities of the functions within the project, Pandas does a very great job of documenting their code through docstrings within each function. These docstrings often include large sections of documentation with different headings. Some examples of these headings are: Parameters, Returns, Raises, See Also, Notes, and Examples. There are even more headings that are utilized throughout the project, however the ones that I listed are the most commonly used. 

Parameters section focuses on listing and explaining all the parameters that the function takes.
Returns section focuses on listing and explaining what the function returns.
Raises section focuses on talking about the different exceptions that can be raised in the function.
See Also section lists similar functions or ones that are used within the current function.
Notes section includes extra notes that a developer may want to know about the function.
Examples section provides thorough examples of how the function can be used.

As you can see, the order in which I listed the sections makes a lot of sense and seems like a natural way to order the sections. The developers of Pandas also thought that this would be the best way to order these headings. In order to ensure that all docstrings within the project are consistent and follow the natural order, they created the GL07 error which ensures that all docstrings are ordered in this manner.

However, before they created this error, many people made the mistake of putting some of the docstrings out of order. I ensured that all docstrings will have ordered sections, fixing almost all GL07 errors within Pandas throughout two pull requests. There were total of 286 errors, and by the end of the second pull request, there were only 14 errors left (I will talk about why I couldn't fix the remaining errors later in this blog). This was a large task and took a lot of time to fix such a large amount of errors.

Fixing the GL07 Error

Running "./scripts/validate_docstrings.py --errors=GL07" will give you all the GL07 errors in Pandas. Adding an additional format parameter to this line ("--format=azure") will allow you to see files and line numbers where the error occurs. With a combination of these parameters, some errors were easily found as the script told me exactly where the error occurs. However, most of the time the task wasn't so easy. Since Pandas is a massive project, a lot of the times the functions are similar or repeated throughout the project. So in order to save on space and work, developers of Pandas created templates and structures for docstrings, which were then utilized in different files. In order to find and fix the origin of the issue, I had to look through where these docstrings originate from, which was sometimes very difficult to find. In addition, there were some cases where the file and line numbers were not displayed and I had to search through the whole project based on which function is used, and find where this function originates from.

Throughout successfully fixing 272 GL07 errors, I became very good at navigating throughout the Pandas project and learned a lot about their structure and the project as a whole. This will be extremely beneficial for my future contributions to Pandas because I am already very familiar with their project now.



Non-Resolved GL07 Errors

There were 14 errors left that I could not currently resolve. I explained in detail the reason behind this to the moderators of Pandas and we agreed that we will find a way to resolve them in the near future. Some of the errors could not be fixed because sometimes there are two different docstrings appended together that come from different files. While I could sort the docstrings independently, once they are appended together the overall order was not sorted.






Comments

Popular posts from this blog

Another Dropna Bug in Pandas

Unit Tests in Pandas

Progress in Open Source