At Probabl, together with the wider community, we continue our dedicated efforts to support and enhance scikit-learn
and its ecosystem. In this post, we provide a retrospective on the work accomplished on scikit-learn
and other supported open-source packages over the past 6 months. Additionally, we outline our updated priorities and focus areas for the next 6 months.
As we reflect on the progress and look ahead, it is important to reiterate the key considerations that guide our decision-making process:
scikit-learn
survey to further align our efforts with user needs and expectations.We would like to emphasize that none of the work detailed below would have been possible without the collaborative efforts of the wider scikit-learn
community.
Let's review the achievements and set the stage for our upcoming focus areas.
This section summarizes the progress of Probabl's work on the different supported open-source projects.
In a previous blog post, we outlined Probabl's priorities for scikit-learn
. In this section, we report on the progress of these priorities.
Short-term priorities
scikit-learn
website to the new PyData theme-based website. This modernization effort improves the user experience and aligns our documentation with the broader PyData ecosystem.AdaBoost
) to handle metadata routing as specified in SLEP6. This new API consistently handles metadata (e.g., sample_weight
, groups
, etc.) for almost all estimators. It allows for fewer bugs and enables new use cases that were not possible before.Both of these projects represent major milestones for scikit-learn
. In our current round of priorities, we outline follow-up work for both the website and metadata routing to further improve and expand on these changes.
Mid-term priorities
scikit-learn
meta-estimator, called TunedThresholdClassifierCV
, allowing optimization of the operational decision. In addition, the FixedThresholdClassifier
estimator has been added and allows to use a pre-specified threshold for the operational decision-making.PCA
and Ridge
estimator trained on torch
tensors located on GPU.scikit-learn
and making them available for third-party projects. For instance, we made public some internal APIs that are required to create scikit-learn
estimators (e.g., data validation, estimator tags). Also, we started to improve the tests checking the API conformance of third-party libraries with scikit-learn
. More work is needed in this area to make it easier to create estimators and check their compatibility with scikit-learn
.Long-term priorities
scikit-learn
1.5.0, 1.5.1, and 1.5.2.scikit-learn
; PyCon Lithuania, PyCon Italia, CZI Open Science, EuroSciPy, PyData Amsterdam, and PyData Paris.Computing orchestration
We dedicated resources to maintain the following projects that are related to computing orchestration: joblib
, loky
, cloudpickle
, threadpoolctl
, and worked on supporting Python free-threaded mode.
A new version of threadpoolctl
was released in the past few months and includes better support for BLAS libraries (FlexiBLAS, OpenBLAS, Netlib, Accelerate, etc.).
We dedicated substantial work to ensuring scikit-learn's compatibility with Python 3.13's free-threaded mode, including extensive testing, necessary adaptations, and reporting any issues to upstream projects.
fairlearn
In the past few months, the main focus of the project has been to ensure that the estimators developed in fairlearn
are compatible with scikit-learn
. To achieve this, we used the testing framework provided by scikit-learn
to test the fairlearn
estimators.
We also helped with the release of fairlearn
and made sure that it is compatible with the different upstream dependencies (e.g. numpy
, scikit-learn
).
skops
The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn
and NumPy
.
skrub
In the past few months, we focused on delivering new features such as:
polars
dataframes in skrub
estimators.TableReport
that provides information to carry out Exploratory Data Analysis (EDA).tabular_learner
.skrub
website to make it more user-friendly and informative.Python in the browser (WebAssembly)
We made sure that scikit-learn
is compatible with the WebAssembly stack: pyodide
, jupyterlite
. We reported potential issues upstream and helped run the SciPy test suite for the pyodide project.
hazardous
The current focus for this project is to define the scope of the library such that it does not overlap with existing tools. While the code is developed in parallel with a research project, we are working on improving tests and documentation to make the library more robust and ready for a first release.
As we look ahead to the next six months, we have identified several key areas where we will concentrate our efforts to further enhance scikit-learn
and its ecosystem. These focus areas align with our general objectives and aim to address current challenges and opportunities in the machine learning landscape.
scikit-learn
We start by focusing on the future work for scikit-learn
.
HistGradientBoosting
estimators to understand the room for improvement of these estimators to match the performance of LightGBM and XGBoost.lorentzenchr
.GradientBoosting
and HistGradientBoosting
estimators into a single class. This will feed into the discussion to make a decision on the best way to move forward.Nystroem
, Ridge
and at least one solver of LogisticRegression
.sample_weight
.Computing orchestration
We have to adapt the whole stack (joblib
, loky
, cloudpickle
, threadpoolctl
) to the CPython free-threaded mode. This will involve some stress testing to ensure that the parallelism works as expected.
Fairlearn
We contribute to the recently updated community roadmap, including involvement in contribution sprints, conference talks and community engagement.
Maintaining scikit-learn
compatibility remains one of the priorities, and contributing to improving the library's core parts by updating and refactoring the existing codebase. We also support the community in adding new methods and extending the toolkit and the learning resources.
skops
The main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn
and NumPy
.
skrub
jeromedockes, Vincent-Maladiere, glemaitre, GaelVaroquaux
We will help a tackling item from the community roadmap available at.
hazardous
Vincent-Maladiere, ogrisel, glemaitre, GaelVaroquaux
MultiIncidenceGradientBoosting
as a meta-estimator to wrap any classifier that supports sample_weight
The following open source engineers from Probabl are contributing to the above priorities for the different projects:
Again, we want to acknowledge that all this work would not have been possible without the incredible support of the scikit-learn
community. The continuous engagement, feedback, and contributions from community members, whether through code, documentation, bug reports, or discussions, have been instrumental in shaping and advancing these projects.