At Probabl, together with the wider community, we continue our dedicated efforts to support and enhance scikit-learn and its ecosystem. In this post, we provide a retrospective on the work accomplished on scikit-learn and other supported open-source packages over the past 6 months. Additionally, we outline our updated priorities and focus areas for the next 6 months.
As we reflect on the progress and look ahead, it is important to reiterate the key considerations that guide our decision-making process:
scikit-learn survey to further align our efforts with user needs and expectations.We would like to emphasize that none of the work detailed below would have been possible without the collaborative efforts of the wider scikit-learn community.
Let's review the achievements and set the stage for our upcoming focus areas.
This section summarizes the progress of Probabl's work on the different supported open-source projects.
In a previous blog post, we outlined Probabl's priorities for scikit-learn. In this section, we report on the progress of these priorities.
Short-term priorities
scikit-learn website to the new PyData theme-based website. This modernization effort improves the user experience and aligns our documentation with the broader PyData ecosystem.AdaBoost) to handle metadata routing as specified in SLEP6. This new API consistently handles metadata (e.g., sample_weight, groups, etc.) for almost all estimators. It allows for fewer bugs and enables new use cases that were not possible before.Both of these projects represent major milestones for scikit-learn. In our current round of priorities, we outline follow-up work for both the website and metadata routing to further improve and expand on these changes.
Mid-term priorities
scikit-learn meta-estimator, called TunedThresholdClassifierCV, allowing optimization of the operational decision. In addition, the FixedThresholdClassifier estimator has been added and allows to use a pre-specified threshold for the operational decision-making.PCA and Ridge estimator trained on torch tensors located on GPU.scikit-learn and making them available for third-party projects. For instance, we made public some internal APIs that are required to create scikit-learn estimators (e.g., data validation, estimator tags). Also, we started to improve the tests checking the API conformance of third-party libraries with scikit-learn. More work is needed in this area to make it easier to create estimators and check their compatibility with scikit-learn.Long-term priorities
scikit-learn 1.5.0, 1.5.1, and 1.5.2.scikit-learn; PyCon Lithuania, PyCon Italia, CZI Open Science, EuroSciPy, PyData Amsterdam, and PyData Paris.Computing orchestrationWe dedicated resources to maintain the following projects that are related to computing orchestration: joblib, loky, cloudpickle, threadpoolctl, and worked on supporting Python free-threaded mode.
A new version of threadpoolctl was released in the past few months and includes better support for BLAS libraries (FlexiBLAS, OpenBLAS, Netlib, Accelerate, etc.).
We dedicated substantial work to ensuring scikit-learn's compatibility with Python 3.13's free-threaded mode, including extensive testing, necessary adaptations, and reporting any issues to upstream projects.
fairlearnIn the past few months, the main focus of the project has been to ensure that the estimators developed in fairlearn are compatible with scikit-learn. To achieve this, we used the testing framework provided by scikit-learn to test the fairlearn estimators.
We also helped with the release of fairlearn and made sure that it is compatible with the different upstream dependencies (e.g. numpy, scikit-learn).
skopsThe main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn and NumPy.
skrubIn the past few months, we focused on delivering new features such as:
polars dataframes in skrub estimators.TableReport that provides information to carry out Exploratory Data Analysis (EDA).tabular_learner.skrub website to make it more user-friendly and informative.Python in the browser (WebAssembly)We made sure that scikit-learn is compatible with the WebAssembly stack: pyodide, jupyterlite. We reported potential issues upstream and helped run the SciPy test suite for the pyodide project.
hazardousThe current focus for this project is to define the scope of the library such that it does not overlap with existing tools. While the code is developed in parallel with a research project, we are working on improving tests and documentation to make the library more robust and ready for a first release.
As we look ahead to the next six months, we have identified several key areas where we will concentrate our efforts to further enhance scikit-learn and its ecosystem. These focus areas align with our general objectives and aim to address current challenges and opportunities in the machine learning landscape.
scikit-learnWe start by focusing on the future work for scikit-learn.
HistGradientBoosting estimators to understand the room for improvement of these estimators to match the performance of LightGBM and XGBoost.lorentzenchr.GradientBoosting and HistGradientBoosting estimators into a single class. This will feed into the discussion to make a decision on the best way to move forward.Nystroem, Ridge and at least one solver of LogisticRegression.sample_weight.Computing orchestrationWe have to adapt the whole stack (joblib, loky, cloudpickle, threadpoolctl) to the CPython free-threaded mode. This will involve some stress testing to ensure that the parallelism works as expected.
FairlearnWe contribute to the recently updated community roadmap, including involvement in contribution sprints, conference talks and community engagement.
Maintaining scikit-learn compatibility remains one of the priorities, and contributing to improving the library's core parts by updating and refactoring the existing codebase. We also support the community in adding new methods and extending the toolkit and the learning resources.
skopsThe main activity in this project relates to non-trivial maintenance tasks to ensure that the project is in a healthy state and compatible with the latest versions of scikit-learn and NumPy.
skrubjeromedockes, Vincent-Maladiere, glemaitre, GaelVaroquaux
We will help a tackling item from the community roadmap available at.
hazardousVincent-Maladiere, ogrisel, glemaitre, GaelVaroquaux
MultiIncidenceGradientBoosting as a meta-estimator to wrap any classifier that supports sample_weightThe following open source engineers from Probabl are contributing to the above priorities for the different projects:
Again, we want to acknowledge that all this work would not have been possible without the incredible support of the scikit-learn community. The continuous engagement, feedback, and contributions from community members, whether through code, documentation, bug reports, or discussions, have been instrumental in shaping and advancing these projects.