Skip to main content

Transfer Learning in [Database] Using SuperDuperDB

ยท 4 min read
Lalith Sagar Devagudi

tl

In this blog post, we will demonstrate how to leverage transfer learning in Database using SuperDuperDB, enabling you to efficiently enhance your AI models and streamline your development process.


Transfer learning has become a cornerstone of modern AI development. By utilizing pre-trained models and fine-tuning them for specific tasks, developers can achieve high performance with less data and computation. However, integrating transfer learning with your data stored in MongoDB presents a unique challenge.

For this demo, we are using MongoDB as an example. SuperDuperDB also supports other databases, including vector-supported SQL databases like Postgres and non-vector-supported databases like MySQL and DuckDB. Please check the documentation for more information about its functionality and the range of data integrations it supports.

A common question arises:

How?

How can I implement transfer learning directly within my MongoDB environment?

The Challengeโ€‹

Transfer learning involves adapting a pre-trained model to a new, related task. This requires not only storing the data but also having access to computational resources and models. Traditional approaches often involve significant setup and integration work, leading to complexities and inefficiencies.

The Solution: SuperDuperDBโ€‹

info

SuperDuperDB simplifies the integration of transfer learning with MongoDB, providing a seamless and scalable solution.

SuperDuperDB bridges the gap between AI models and databases, offering a flexible and powerful framework to perform transfer learning directly within your MongoDB environment.

  • Flexibility: Integrate any model from the Python ecosystem, including torch, sklearn, transformers, and OpenAI's API.
  • Scalability: Utilize scalable compute resources co-located with your database, ensuring vertical and horizontal performance.
  • Ease of Use: Minimize boilerplate code and avoid complex architectures. SuperDuperDB handles vector computations, storage, and retrieval natively.

Minimal Boilerplate to Connect to MongoDBโ€‹

Connecting SuperDuperDB to MongoDB is straightforward. For this demo, we will be using a local MongoDB instance. However, this can be switched to MongoDB with authentication or even a MongoDB Atlas URI.

from superduperdb import superduper

db = superduper('mongodb://localhost:27017/test')

my_collection = Collection("documents")
info

The MongoDB URI can be either a locally hosted MongoDB instance or a MongoDB Atlas URI. For more information here

Implement Transfer Learning with SuperDuperDBโ€‹

Setting up transfer learning involves fine-tuning a pre-trained model on your specific dataset. SuperDuperDB makes this process easy with minimal setup.

Creating Custom Datatypes in SuperDuperDB

SuperDuperDB allows you to create custom datatypes to handle various data types that your database backend might not natively support. This flexibility enables you to insert any type of data into your database seamlessly.

example

For example, you can create custom datatypes for vectors, tensors, arrays, PDFs, images, audio files, videos, and more. Please check Create datatype for more examples and details on how to construct custom datatypes.

Step 1: Load a Pre-trained Modelโ€‹

First, we load a pre-trained model. In this example, we use a model from the transformers library.

import sentence_transformers
from superduperdb import vector, Listener
from superduperdb.ext.sentence_transformers import SentenceTransformer

superdupermodel = SentenceTransformer(
identifier="embedding",
object=sentence_transformers.SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2"),
datatype=vector(shape=(384,)),
postprocess=lambda x: x.tolist(),
)

jobs, listener = db.apply(
Listener(
model=superdupermodel,
select=select,
key=key,
identifier="features"
)
)

Step 2: Train the Modelโ€‹

Next, we train the model on a specific task using data from MongoDB and the features or the embeddings computed above by SentenceTransformer model

from sklearn.linear_model import LogisticRegression
from superduperdb.ext.sklearn.model import SklearnTrainer, Estimator

input_key = listener.outputs
# Create a Logistic Regression model
model = LogisticRegression()
model = Estimator(
object=model,
identifier='my-model',
trainer=SklearnTrainer(
identifier='my-trainer',
key=(input_key, target_key),
select=select,
)
)

# Training the model in single step
db.apply(model)

Deploy and Use the Trained Modelโ€‹

Once the model is trained, you can deploy it and use it for predictions on new data directly within MongoDB.

import numpy as np

for doc in db.execute(table_or_collection.find().limit(1)):
test_sample_x = doc["_outputs"]['features::0']
test_sample_y = doc['y']


print("actual: ", test_sample_y)
print("predicted: ", db.load('model', 'my-model').predict(np.array([test_sample_x])))

Conclusionโ€‹

SuperDuperDB provides a robust solution for integrating transfer learning with any Database for example, MongoDB, reducing complexity, and enhancing scalability. By using SuperDuperDB, developers can focus on building and fine-tuning models without worrying about the underlying infrastructure and data management.

To explore more, check out our other use cases in the documentation

SuperDuperDB also supports vector-supported SQL databases like Postgres and non-vector-supported databases like MySQL and DuckDB.

Please check the documentation for more information about its functionality and the range of data integrations it supports.

Contributors are welcome!โ€‹

SuperDuperDB is open-source and permissively licensed under the Apache 2.0 license. We would like to encourage developers interested in open-source development to contribute in our discussion forums, issue boards and by making their own pull requests. We'll see you on GitHub!