Contributing

This document describes how to contribute to the ProteomicsML resource by adding new or updating existing tutorials and/or datasets.

Before you begin

At ProteomicsML, we pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. By interacting with or contributing to ProteomicsML at https://github.com/ProteomicsML or at https://proteomicsml.org, you agree to our Code of Conduct. Violation of our Code of Conduct may ultimately lead to a permanent ban from any sort of public interaction within the community.
🤝 Read the Code of Conduct

If you have an idea for a new tutorial or dataset, or found a mistake, you are welcome to communicate it with the community by opening a discussion thread in GitHub Discussions or by creating an issue.
💬 Start a discussion thread
💡 Open an issue

The ProteomicsML infrastructure

ProteomicsML uses the Quarto system to publish a static website from markdown and Jupyter IPython notebook files. All source files are maintained at ProteomicsML/ProteomicsML. Upon each commit on the main branch (after merging a pull request), the website is rebuilt on GitHub Actions and pushed to the ProteomicsML/proteomicsml.github.io repository, where it is hosted with GitHub Pages on the ProteomicsML.org website. See Website deployment for the full deployment workflow.

How to contribute

Development setup

  1. Fork ProteomicsML/ProteomicsML on GitHub to make your changes.
  2. Clone your fork of the repository to your local machine.
  3. Install Quarto to build the website on your machine.
  4. To preview the website while editing, run: quarto preview . --render html

Maintainers with write access to the repository can skip the first two steps and make a new local branch instead. Direct commits to the main branch are not allowed.

Adding/updating a tutorial

ProteomicsML tutorials are educational Jupyter notebooks that combine fully functional code cells and descriptive text cells. The end result should be a notebook that is easy to comprehend to anyone with a basic understanding of proteomics, programming, and machine learning. When adding or updating a tutorial, please follow these rules and conventions:

  1. Title, filename, metadata, and subheadings

    1. Tutorials are grouped by data type: Detectability, Fragmentation, Ion mobility, and Retention time. Place your tutorial notebook in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your tutorial is part of a new data type group, please open a new discussion thread first.

    2. The filename should be an abbreviated version of the tutorial title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-tutorial.ipynb.

    3. The following front matter metadata items are required (see the Quarto Documentation for more info):

      • title: A descriptive sentence-like title
      • authors: All authors that significantly contributed to the tutorial
      • date: Use last-modified to automatically render the correct date
      Note

      Unfortunately, YAML front matter is not rendered by Google Colab. Instead it is interpreted as plain markdown and the first cell of the notebook might look out of place when loaded into Google Colab. Nevertheless, the front matter results in a clean header on ProteomicsML.org, the primary platform for viewing tutorials.

    4. Quarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).

    5. Add an Open with Colab badge directly after the front matter metadata. The badge should be hyperlinked to open the notebook in Colab directly from GitHub. This can be achieved by replacing https://github.com/ with https://colab.research.google.com/github/ in the full URL to the file on GitHub. Additionally, in this URL the filename should be prefixed with an underscore (_); see point 2 in Website deployment for more info on notebook copies for Colab.

      For example:

      [![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProteomicsML/ProteomicsML/blob/main/tutorials/fragmentation/_nist-1-parsing-spectral-library.ipynb)

      renders as

      Note

      The URL will not work (or be updated) until the pull request adding or updating the notebook is merged into the main branch.

  2. Subject and contents

    1. Each tutorial should clearly show and describe one or more steps in a certain machine learning workflow for proteomics.

    2. Sufficiently describe each code cell and each step in the workflow.

    3. Tutorials should ideally be linked to a single ProteomicsML dataset from the same group.

    4. While multiple tutorials can be added for a single data type, make sure that each tutorial is sufficiently different from the others in terms of methodology and/or datasets used.

    5. All original publications that describe the methodologies, datasets, or tools that are used in the tutorial should be properly cited following scientific authoring conventions. To add a citation, add a bibtex entry to references.bib and use the Quarto citation tag. For example, [@ProteomicsML2022] renders to: (Rehfeldt et al. 2022). More info can be found in the Quarto documentation.

      Tip

      Use doi2bib.org to easily get bibtex entries for any given publication.

  3. Code cells and programming language

    1. Tutorials should work on all major platforms (Linux, Windows, macOS). An exception to this rule can be made if one or more tools central to the tutorial is not cross-platform.

    2. Per ProteomicsML convention, tutorials should use the Python programming language. Exceptions may be allowed if the other language is essential to the tutorial or methodology.

    3. ProteomicsML recommends Google Colab to interactively use tutorial notebooks. Therefore, all code should be backwards compatible with the Python version used by Google Colab. At time of writing, this is Python 3.7.

    4. Dependencies should ideally be installable with pip. A first code cell can be used to install all requirements using the Jupyter shell prefix !. For instance: ! pip install pandas.

    5. Code should be easy to read. For Python, follow the PEP8 style guide where possible.

    6. Upon pull request (PR) creation, all expected output cells should be present. When rendering the ProteomicsML website, notebooks are not rerun. Therefore, as a final step before submitting your PR, restart the kernel, run all cells from start to finish, and save the notebook. See point 2 in Website deployment for more info on notebook copies for Colab.

Adding/updating a dataset

ProteomicsML datasets are community-curated proteomics datasets fit for machine learning. Ideally, each dataset is accompanied by a tutorial. When adding or updating a dataset, please follow these rules and conventions:

  1. Dataset description and data files:

    1. Each dataset is represented as a single markdown file describing the dataset.

    2. The data itself can be added in one of three ways:

      1. If the dataset itself consists of one or more files, each smaller than 50 MB, they can be added in a subfolder with the same name as the markdown file. These files should be individually gzipped to save space and to prevent line-by-line tracking by Git.

        Note

        Gzipped CSV files can very easily be read by Pandas into a DataFrame. Simply use the filename with the .gz suffix in the pandas.read_csv() function and Pandas will automatically unzip the file while reading.

      2. Larger files can be added to the ProteomicsML FTP file server by the project maintainers. Please request this in your pull request.

      3. Files that are already publicly and persistently stored elsewhere, can be represented by solely the markdown file. In this case, all tutorials using this dataset should start from the file(s) as is and include any required preprocessing steps.[TODO: List supported platforms]

  2. Title, filename, and metadata:

    1. Datasets are grouped by data type: Fragmentation, Ion mobility, Detectability, or Retention time. Place your dataset and markdown description in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your dataset is part of a new data type group, please open a new discussion thread first.

    2. The filename / directory name should be an abbreviated version of the dataset title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-dataset.md / title-of-dataset/.

    3. The following front matter metadata items are required (see the Quarto Documentation for more info):

      • title: A descriptive sentence-like title
      • date: Use last-modified to automatically render the correct date
    4. Quarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).

  3. Dataset description

    Download the readme template, fill out the details, and add download links

Opening a pull request to add your contributions

  • Commit and push your changes to your fork.
  • Open a pull request with these changes. Choose the pull request template that fits your changes best.
  • The pull request should pass all the continuous integration tests which are automatically run by GitHub Actions.
  • All pull requests should be approved by at least two maintainers before they can be merged.

Becoming a maintainer

If you would like to become a maintainer and review pull requests by others, please start a discussion thread to let us know!

Website deployment

When a pull request has been opened, the following GitHub Action is triggered:
Test website rendering: The full website is rendered to check that no errors occur. This action should already have been run successfully for the pull request that implemented the changes. Nevertheless, merging could also introduce new issues.

When a pull request has been marked “ready for review”, the following GitHub Action is triggered:
Update notebook copies: A script is run to make copies of all tutorial notebooks with all output removed. The filenames of these copies are prepended with an underscore and should be used to open the notebooks interactively, e.g., in Google Colab.
This script adds a commit to the pull request branch, which can only be accepted once this action has run successfully.

When a pull request is merged with the main branch, the following GitHub Action is triggered:
Publish website: Quarto is used to render the static website, which is then force-pushed to the ProteomicsML/proteomicsml.github.io repository. This repository is served on proteomicsml.org through GitHub Pages.

References

Rehfeldt, Tobias Greisager, Ralf Gabriels, Robbin Bouwmeester, Siegfried Gessulat, Benjamin Neely, Magnus Palmblad, Yasset Perez-Riverol, Tobias Schmidt, Juan Antonio Vizcaı́no, and Eric W. Deutsch. 2022. ProteomicsML: An Online Platform for Community-Curated Datasets and Tutorials for Machine Learning in Proteomics,” October. https://doi.org/10.26434/chemrxiv-2022-2s6kx.