Open source software priorities at Probabl - Chapter 2

Written by Guillaume Lemaitre | Wednesday, November 27 2024

At Probabl, together with the wider community, we continue our dedicated efforts to support and enhance scikit-learn and its ecosystem. In this post, we provide a retrospective on the work accomplished on scikit-learn and other supported open-source packages over the past 6 months. Additionally, we outline our updated priorities and focus areas for the next 6 months.

As we reflect on the progress and look ahead, it is important to reiterate the key considerations that guide our decision-making process:

Identifying big picture goals and themes that improve user experience and provide tools for the entire machine learning pipeline, from training and exploration to production.
Insights from community surveys conducted in 2023 to understand core contributors' concerns and priorities. Additionally, we will soon incorporate feedback from the ongoing user scikit-learn survey to further align our efforts with user needs and expectations.
Commitment to revisiting our list of priorities approximately twice a year.

We would like to emphasize that none of the work detailed below would have been possible without the collaborative efforts of the wider scikit-learn community.

Let's review the achievements and set the stage for our upcoming focus areas.

Retrospective on the past 6 months

This section summarizes the progress of Probabl's work on the different supported open-source projects.

`scikit-learn`

In a previous blog post, we outlined Probabl's priorities for scikit-learn. In this section, we report on the progress of these priorities.

Short-term priorities

1. Website Migration: We successfully migrated the scikit-learn website to the new PyData theme-based website. This modernization effort improves the user experience and aligns our documentation with the broader PyData ecosystem.
2. SLEP6 - Metadata Routing: We completed adapting all estimators (apart from AdaBoost) to handle metadata routing as specified in SLEP6. This new API consistently handles metadata (e.g., sample_weight, groups, etc.) for almost all estimators. It allows for fewer bugs and enables new use cases that were not possible before.

Both of these projects represent major milestones for scikit-learn. In our current round of priorities, we outline follow-up work for both the website and metadata routing to further improve and expand on these changes.

Mid-term priorities

1. Implementation of a new scikit-learn meta-estimator, called TunedThresholdClassifierCV, allowing optimization of the operational decision. In addition, the FixedThresholdClassifier estimator has been added and allows to use a pre-specified threshold for the operational decision-making.
2. Progress on the Array-API adoption by making it possible to train and test a pipeline containing a PCA and Ridge estimator trained on torch tensors located on GPU.
3. Some work to improve the developer experience by improving the developer tools available in scikit-learn and making them available for third-party projects. For instance, we made public some internal APIs that are required to create scikit-learn estimators (e.g., data validation, estimator tags). Also, we started to improve the tests checking the API conformance of third-party libraries with scikit-learn. More work is needed in this area to make it easier to create estimators and check their compatibility with scikit-learn.

Long-term priorities

1. General maintenance of the project: We reviewed a substantial number of pull requests and issues opened by external contributors. Indeed, a member of the scikit-learn maintainers team make sure to triage newly opened issues and provide a first quick response to newly open pull-requests.
2. Python 3.13 support: We worked on supporting Python 3.13 free-threaded mode and set up regular testing to ensure compatibility until it becomes officially available. We also closely collaborate with the NumPy, SciPy, and Quansight teams to have a shared effort to ensure compatibility with upstream projects.
3. Releases: Since February 2024, we released scikit-learn 1.5.0, 1.5.1, and 1.5.2.
4. Continuous Integration and Continuous Deployment: We kept our continuous integration and continuous deployment infrastructure up to date.
5. Community outreach: Since February 2024, we attended multiple Python conferences to promote the latest work on scikit-learn; PyCon Lithuania, PyCon Italia, CZI Open Science, EuroSciPy, PyData Amsterdam, and PyData Paris.

`Computing orchestration`

We dedicated resources to maintain the following projects that are related to computing orchestration: joblib, loky, cloudpickle, threadpoolctl, and worked on supporting Python free-threaded mode.

A new version of threadpoolctl was released in the past few months and includes better support for BLAS libraries (FlexiBLAS, OpenBLAS, Netlib, Accelerate, etc.).

We dedicated substantial work to ensuring scikit-learn's compatibility with Python 3.13's free-threaded mode, including extensive testing, necessary adaptations, and reporting any issues to upstream projects.

`fairlearn`

In the past few months, the main focus of the project has been to ensure that the estimators developed in fairlearn are compatible with scikit-learn. To achieve this, we used the testing framework provided by scikit-learn to test the fairlearn estimators.

We also helped with the release of fairlearn and made sure that it is compatible with the different upstream dependencies (e.g. numpy, scikit-learn).

`skops`

The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn and NumPy.

`skrub`

In the past few months, we focused on delivering new features such as:

Support for polars dataframes in skrub estimators.
Adding interactive reports of dataframes through a TableReport that provides information to carry out Exploratory Data Analysis (EDA).
Easing the creation of predictive models by removing boilerplate code using a factory function called tabular_learner.
Improving the landing page of the skrub website to make it more user-friendly and informative.

`Python in the browser (WebAssembly)`

We made sure that scikit-learn is compatible with the WebAssembly stack: pyodide, jupyterlite. We reported potential issues upstream and helped run the SciPy test suite for the pyodide project.

`hazardous`

The current focus for this project is to define the scope of the library such that it does not overlap with existing tools. While the code is developed in parallel with a research project, we are working on improving tests and documentation to make the library more robust and ready for a first release.

Focus Areas for the Next 6 Months

As we look ahead to the next six months, we have identified several key areas where we will concentrate our efforts to further enhance scikit-learn and its ecosystem. These focus areas align with our general objectives and aim to address current challenges and opportunities in the machine learning landscape.

`scikit-learn`

We start by focusing on the future work for scikit-learn.

Statistical, algorithmic, and numerical correctness: ogrisel, snath-xoc, antoinebaker, jeremiedbb
- We will continue our maintenance work to ensure that sample weights are correctly handled in estimators.
- We will improve the documentation to better explain which cross-validation strategies should be used in line with the end-user's goal.
- We will work on improving the solvers of linear models.
New visualizations: lucyleeow, glemaitre, jeremiedbb
- We will develop some additional displays to help with model inspection.
- We will work on adapting the existing displays to visualize the outputs of cross-validation results.
- We will search for ways to improve the HTML representation of estimators.
SLEP023 Callbacks API: jeremiedbb, adrinjalali, glemaitre, ogrisel
- We will work on merging the infrastructure for callbacks together with a callback to provide progress bars.
- Subsequently, we will work on additional callbacks.
- Some of the callbacks will required to discuss with maintainers of other projects. We should make sure to discuss with them to provide the best experience.
Improve histogram gradient-boosting: GaelVaroquaux, glemaitre, lesteve, ogrisel
- We will benchmark in terms of statistical performance the HistGradientBoosting estimators to understand the room for improvement of these estimators to match the performance of LightGBM and XGBoost.
- We will work on improving the computational performance following the work of lorentzenchr.
- We will assess what are the potential issues, pros, and cons to merge GradientBoosting and HistGradientBoosting estimators into a single class. This will feed into the discussion to make a decision on the best way to move forward.
Array API: betatim, ogrisel, lesteve, StefanieSenger, OmarManzoor, EmilyXinyi
- We will continue our work on the Array API adoption. Notably, we will work on adopting the API for Nystroem, Ridge and at least one solver of LogisticRegression.
Python 3.13 support: lesteve, ogrisel
- We will continue our work on supporting Python 3.13 free-threaded and release wheels and binaries for the new Python version.
Metadata routing: adrinjalali, StefanieSenger, glemaitre, OmarManzoor, adam2392
- We will continue some further work related to metadata routing and SLEP6.
- We will define, whenever possible, a default behavior for sample_weight.
- We will provide a visualization tool to understand how metadata are routed between estimators.
- We will write or modify examples to illustrate how to use metadata routing.
- We will make sure to discuss with projects that would benefit from this feature to ensure that the feature is useful for the community.
Model inspection: lucyleeow, glemaitre, TamaraAtanasoska
- We will check on adapting the current mean decrease in impurity (MDI) to work on test sets.
- We will work on SLEP11 to provide a unified interface to inspect the "feature importances" of estimators.
Developer API: adrinjalali, glemaitre, adam2392, Charlie-XIAO, OmarManzoor
- We will continue our work on improving the test suite by migrating some common tests to the estimator checks, to finalize common check categories, and to improve the associated documentation.
Documentation improvements:
- We plan to further improve the gallery of examples, the contributor guide, and the developer guide. To carry on this work, we will organize two dedicated sprints focusing especially on the documentation.
Project maintenance:
- We will continue our regular triage of issues and reviewing community pull requests.
- We will make sure to keep our continuous integration and continuous deployment infrastructure up to date.
- We will make sure to reinstate our ASV benchmark allowing use to detect potential performance regressions.
Releases:
- We will continue to release on a regular basis.

`Computing orchestration`

We have to adapt the whole stack (joblib, loky, cloudpickle, threadpoolctl) to the CPython free-threaded mode. This will involve some stress testing to ensure that the parallelism works as expected.

`Fairlearn`

TamaraAtanasoska, adrinjalali

We contribute to the recently updated community roadmap, including involvement in contribution sprints, conference talks and community engagement.

Maintaining scikit-learn compatibility remains one of the priorities, and contributing to improving the library's core parts by updating and refactoring the existing codebase. We also support the community in adding new methods and extending the toolkit and the learning resources.

`skops`

TamaraAtanasoska, adrinjalali

The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn and NumPy.

`skrub`

jeromedockes, Vincent-Maladiere, glemaitre, GaelVaroquaux

We will help a tackling item from the community roadmap available at.

`hazardous`

Vincent-Maladiere, ogrisel, glemaitre, GaelVaroquaux

Implement metrics for competing risks
- Concordance Index
- Refine metrics API
Implement early stopping based on validation score
Improve the documentation and examples
Start releasing the package
Generalize MultiIncidenceGradientBoosting as a meta-estimator to wrap any classifier that supports sample_weight

Our contributors

The following open source engineers from Probabl are contributing to the above priorities for the different projects:

Again, we want to acknowledge that all this work would not have been possible without the incredible support of the scikit-learn community. The continuous engagement, feedback, and contributions from community members, whether through code, documentation, bug reports, or discussions, have been instrumental in shaping and advancing these projects.

View full post