Tuesday, February 13 2024

Anything Classification with scikit-learn

When you think of "tools for machine learning" then it is likely that you'll think about scikit-learn. It's a very popular library with over a million daily downloads that's been at it for over a decade. However, when you think of "tools for image classification", then you may instead think of tools like Keras or Pytorch. Scikit-learn has historically been very well known for tabular use-cases, not image classification. But at the same time, there's nothing stopping scikit-learn from doing image classification as well.

Natively scikit-learn comes with way to construct a machine learning pipeline as well as some components that click together nicely. You can select columns from a table, preprocess them and then pass these on to a machine learning algorithm. But if you think about what needs to happen for image classification, then you may need different components to deal with images, but you can still re-use the idea behind the pipeline.

Pipeline Drawing

This is where plugins can make a big difference. For image classification in particular you might use embetter because it contains components that can turn images into numeric features in such a way that a scikit-learn model can learn from it. You can still re-use pipeline and the machine learning model, but you just swap out the parts that deal with the preprocessing.

Pipeline Drawing

What's particularly nice about the embetter approach is that under the hood it re-uses pre-trained neural networks do handle the heavy lifting of preprocessing the images. By following that up with a simple classifier, you'll be able to get reasonable performance with only a fraction of building a model from scratch. Here's what the code might look like to build such a pipeline.

# Load scikit-learn components
from sklearn.linear_model import LogisticRegression

# Load embetter components
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder

# Build a pipeline as you would normally
image_emb_pipeline = make_pipeline(
    ImageLoader(convert="RGB"),
    ClipEncoder(),
    LogisticRegression()
)
# Fit the whole model as you would normally
image_emb_pipeline.fit(image_paths, image_labels)

It deserves to be highlighted that there are still trade-offs here. If you want to use the latest techniques, or if you want to fully customise your model, then you may want to use to Keras or Pytorch instead. But being able to re-use scikit-learn for this use-case has some serious benefits. Not only can you can keep using a familiar framework, but this approach also serves very well for rapid prototyping.

The great thing about having an ecosystem of plugins is also that people can contribute. Right now embetter has tools for embedding numeric features for text and images, but if somebody wants to contribute techniques for audio it will unlock even more applications for very rapid prototyping.

Scikit-learn is incredibly flexible.

I'm using images as an example here, but it serves as an example for a larger point. Even in the age of LLMs and large deep learning models, scikit-learn is still an incredibly flexible and pragmatic tool to have around. It's easy to re-use ideas from other fields and get them working in the scikit-learn ecosystem and _that_ is incredibly powerful.

To make sure lessons like this aren't forgotten, we've decided to invest in a YouTube series where we can showcase some ideas from the ecosystem worth sharing. It gives us the opportunity to highlight great tools from the ecosystem as well as techniques that deserve to be in the spotlight. The first few videos are online right now, including this first one about image classification in scikit-learn.