In the open source team at Probabl, we focus on scikit-learn and certain aspects of its ecosystem....
Open source software priorities at Probabl - Chapter 2
At Probabl, together with the wider community, we continue our dedicated efforts to support and enhance scikit-learn
and its ecosystem. In this post, we provide a retrospective on the work accomplished on scikit-learn
and other supported open-source packages over the past 6 months. Additionally, we outline our updated priorities and focus areas for the next 6 months.
As we reflect on the progress and look ahead, it is important to reiterate the key considerations that guide our decision-making process:
- Identifying big picture goals and themes that improve user experience and provide tools for the entire machine learning pipeline, from training and exploration to production.
- Insights from community surveys conducted in 2023 to understand core contributors' concerns and priorities. Additionally, we will soon incorporate feedback from the ongoing user
scikit-learn
survey to further align our efforts with user needs and expectations. - Commitment to revisiting our list of priorities approximately twice a year.
We would like to emphasize that none of the work detailed below would have been possible without the collaborative efforts of the wider scikit-learn
community.
Let's review the achievements and set the stage for our upcoming focus areas.
Retrospective on the past 6 months
This section summarizes the progress of Probabl's work on the different supported open-source projects.
In a previous blog post, we outlined Probabl's priorities for scikit-learn
. In this section, we report on the progress of these priorities.
Short-term priorities
-
- Website Migration: We successfully migrated the
scikit-learn
website to the new PyData theme-based website. This modernization effort improves the user experience and aligns our documentation with the broader PyData ecosystem. - SLEP6 - Metadata Routing: We completed adapting all estimators (apart from
AdaBoost
) to handle metadata routing as specified in SLEP6. This new API consistently handles metadata (e.g.,sample_weight
,groups
, etc.) for almost all estimators. It allows for fewer bugs and enables new use cases that were not possible before.
- Website Migration: We successfully migrated the
Both of these projects represent major milestones for scikit-learn
. In our current round of priorities, we outline follow-up work for both the website and metadata routing to further improve and expand on these changes.
Mid-term priorities
-
- Implementation of a new
scikit-learn
meta-estimator, calledTunedThresholdClassifierCV
, allowing optimization of the operational decision. In addition, theFixedThresholdClassifier
estimator has been added and allows to use a pre-specified threshold for the operational decision-making. - Progress on the Array-API adoption by making it possible to train and test a pipeline containing a
PCA
andRidge
estimator trained ontorch
tensors located on GPU. - Some work to improve the developer experience by improving the developer tools available in
scikit-learn
and making them available for third-party projects. For instance, we made public some internal APIs that are required to createscikit-learn
estimators (e.g., data validation, estimator tags). Also, we started to improve the tests checking the API conformance of third-party libraries withscikit-learn
. More work is needed in this area to make it easier to create estimators and check their compatibility withscikit-learn
.
- Implementation of a new
Long-term priorities
-
- General maintenance of the project: We reviewed a substantial number of pull requests and issues opened by external contributors. Indeed, a member of the scikit-learn maintainers team make sure to triage newly opened issues and provide a first quick response to newly open pull-requests.
- Python 3.13 support: We worked on supporting Python 3.13 free-threaded mode and set up regular testing to ensure compatibility until it becomes officially available. We also closely collaborate with the NumPy, SciPy, and Quansight teams to have a shared effort to ensure compatibility with upstream projects.
- Releases: Since February 2024, we released
scikit-learn
1.5.0, 1.5.1, and 1.5.2. - Continuous Integration and Continuous Deployment: We kept our continuous integration and continuous deployment infrastructure up to date.
- Community outreach: Since February 2024, we attended multiple Python conferences to promote the latest work on
scikit-learn
; PyCon Lithuania, PyCon Italia, CZI Open Science, EuroSciPy, PyData Amsterdam, and PyData Paris.
Computing orchestration
We dedicated resources to maintain the following projects that are related to computing orchestration: joblib
, loky
, cloudpickle
, threadpoolctl
, and worked on supporting Python free-threaded mode.
A new version of threadpoolctl
was released in the past few months and includes better support for BLAS libraries (FlexiBLAS, OpenBLAS, Netlib, Accelerate, etc.).
We dedicated substantial work to ensuring scikit-learn's compatibility with Python 3.13's free-threaded mode, including extensive testing, necessary adaptations, and reporting any issues to upstream projects.
fairlearn
In the past few months, the main focus of the project has been to ensure that the estimators developed in fairlearn
are compatible with scikit-learn
. To achieve this, we used the testing framework provided by scikit-learn
to test the fairlearn
estimators.
We also helped with the release of fairlearn
and made sure that it is compatible with the different upstream dependencies (e.g. numpy
, scikit-learn
).
skops
The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn
and NumPy
.
skrub
In the past few months, we focused on delivering new features such as:
- Support for
polars
dataframes inskrub
estimators. - Adding interactive reports of dataframes through a
TableReport
that provides information to carry out Exploratory Data Analysis (EDA). - Easing the creation of predictive models by removing boilerplate code using a factory function called
tabular_learner
. - Improving the landing page of the
skrub
website to make it more user-friendly and informative.
Python in the browser (WebAssembly)
We made sure that scikit-learn
is compatible with the WebAssembly stack: pyodide
, jupyterlite
. We reported potential issues upstream and helped run the SciPy test suite for the pyodide project.
hazardous
The current focus for this project is to define the scope of the library such that it does not overlap with existing tools. While the code is developed in parallel with a research project, we are working on improving tests and documentation to make the library more robust and ready for a first release.
Focus Areas for the Next 6 Months
As we look ahead to the next six months, we have identified several key areas where we will concentrate our efforts to further enhance scikit-learn
and its ecosystem. These focus areas align with our general objectives and aim to address current challenges and opportunities in the machine learning landscape.
scikit-learn
We start by focusing on the future work for scikit-learn
.
- Statistical, algorithmic, and numerical correctness: ogrisel, snath-xoc, antoinebaker, jeremiedbb
- We will continue our maintenance work to ensure that sample weights are correctly handled in estimators.
- We will improve the documentation to better explain which cross-validation strategies should be used in line with the end-user's goal.
- We will work on improving the solvers of linear models.
- New visualizations: lucyleeow, glemaitre, jeremiedbb
- We will develop some additional displays to help with model inspection.
- We will work on adapting the existing displays to visualize the outputs of cross-validation results.
- We will search for ways to improve the HTML representation of estimators.
- SLEP023 Callbacks API: jeremiedbb, adrinjalali, glemaitre, ogrisel
- We will work on merging the infrastructure for callbacks together with a callback to provide progress bars.
- Subsequently, we will work on additional callbacks.
- Some of the callbacks will required to discuss with maintainers of other projects. We should make sure to discuss with them to provide the best experience.
- Improve histogram gradient-boosting: GaelVaroquaux, glemaitre, lesteve, ogrisel
- We will benchmark in terms of statistical performance the
HistGradientBoosting
estimators to understand the room for improvement of these estimators to match the performance of LightGBM and XGBoost. - We will work on improving the computational performance following the work of
lorentzenchr
. - We will assess what are the potential issues, pros, and cons to merge
GradientBoosting
andHistGradientBoosting
estimators into a single class. This will feed into the discussion to make a decision on the best way to move forward.
- We will benchmark in terms of statistical performance the
- Array API: betatim, ogrisel, lesteve, StefanieSenger, OmarManzoor, EmilyXinyi
- We will continue our work on the Array API adoption. Notably, we will work on adopting the API for
Nystroem
,Ridge
and at least one solver ofLogisticRegression
.
- We will continue our work on the Array API adoption. Notably, we will work on adopting the API for
- Python 3.13 support: lesteve, ogrisel
- We will continue our work on supporting Python 3.13 free-threaded and release wheels and binaries for the new Python version.
- Metadata routing: adrinjalali, StefanieSenger, glemaitre, OmarManzoor, adam2392
- We will continue some further work related to metadata routing and SLEP6.
- We will define, whenever possible, a default behavior for
sample_weight
. - We will provide a visualization tool to understand how metadata are routed between estimators.
- We will write or modify examples to illustrate how to use metadata routing.
- We will make sure to discuss with projects that would benefit from this feature to ensure that the feature is useful for the community.
- Model inspection: lucyleeow, glemaitre, TamaraAtanasoska
- We will check on adapting the current mean decrease in impurity (MDI) to work on test sets.
- We will work on SLEP11 to provide a unified interface to inspect the "feature importances" of estimators.
- Developer API: adrinjalali, glemaitre, adam2392, Charlie-XIAO, OmarManzoor
- We will continue our work on improving the test suite by migrating some common tests to the estimator checks, to finalize common check categories, and to improve the associated documentation.
- Documentation improvements:
- We plan to further improve the gallery of examples, the contributor guide, and the developer guide. To carry on this work, we will organize two dedicated sprints focusing especially on the documentation.
- Project maintenance:
- We will continue our regular triage of issues and reviewing community pull requests.
- We will make sure to keep our continuous integration and continuous deployment infrastructure up to date.
- We will make sure to reinstate our ASV benchmark allowing use to detect potential performance regressions.
- Releases:
- We will continue to release on a regular basis.
Computing orchestration
We have to adapt the whole stack (joblib
, loky
, cloudpickle
, threadpoolctl
) to the CPython free-threaded mode. This will involve some stress testing to ensure that the parallelism works as expected.
Fairlearn
We contribute to the recently updated community roadmap, including involvement in contribution sprints, conference talks and community engagement.
Maintaining scikit-learn
compatibility remains one of the priorities, and contributing to improving the library's core parts by updating and refactoring the existing codebase. We also support the community in adding new methods and extending the toolkit and the learning resources.
skops
The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn
and NumPy
.
skrub
jeromedockes, Vincent-Maladiere, glemaitre, GaelVaroquaux
We will help a tackling item from the community roadmap available at.
hazardous
Vincent-Maladiere, ogrisel, glemaitre, GaelVaroquaux
- Implement metrics for competing risks
- Concordance Index
- Refine metrics API
- Implement early stopping based on validation score
- Improve the documentation and examples
- Start releasing the package
- Generalize
MultiIncidenceGradientBoosting
as a meta-estimator to wrap any classifier that supportssample_weight
Our contributors
The following open source engineers from Probabl are contributing to the above priorities for the different projects:
- Adrin Jalali
- Antoine Baker
- Guillaume Lemaitre
- Jérémy du Boisberranger
- Loïc Esteve
- Olivier Grisel
- Shruti Nath
- Stefanie Senger
- Tamara Atanasoska
- Vincent Maldière
Again, we want to acknowledge that all this work would not have been possible without the incredible support of the scikit-learn
community. The continuous engagement, feedback, and contributions from community members, whether through code, documentation, bug reports, or discussions, have been instrumental in shaping and advancing these projects.