Create Embeddings
Convert text to embeddings
In this section, we'll take simple text document and convert it to embeddings
There are many models available to convert from text to embeddings from open source to closed source and from cheap to expensive models. The objective of this guide is to learn about RAG and build a simple RAG application.
We're going to use text-embedding-3-small
model of openai
as most of us knows about openai
.
Below is the method for converting text to embedding
(array of numbers). Each model will have their unique way to convert text to embeddings and each has its own array length.
Typescript Code:
export async function createEmbeddingOpenAI(text: string) {
const embeddingModel = "text-embedding-3-small";
const embedding = await client.embeddings.create({
input: text,
model: embeddingModel,
encoding_format: "float",
});
return embedding.data[0].embedding;
}
Python Code:
import openai
import os
openai.api_key = os.environ.get("OPENAI_API_KEY")
def create_embedding_openai(text: str):
response = openai.embeddings.create(input=text, model="text-embedding-3-small", encoding_format="float")
return response.data[0].embedding
Now, we've the ability to convert text to embeddings. We should know how to store the embeddings.
In last section, we've mentioned that we're going to use pgvector
extension of postgres to store.
We're going to supabase service for postgres database as they provide free database.
Once, you've enabled the extension - you can store the data in vector column.
Below is a simple postgres table with 2 columns - doc_name
and embedding
.
embedding
column is responsible for storing vector information. The length of the vector depends on the embedding model that you use.
As we're using text-embedding-3-small
model, the vector length is 1536.
CREATE TABLE docs (
doc_name TEXT NOT NULL,
embedding vector(1536),
PRIMARY KEY (doc_name)
);
We know how to convert text to embedding and we know where to store. Let's write some utility function to store the embedding in the postgres database.
Typescript Code:
import { createClient } from "@supabase/supabase-js";
const supabaseUrl = process.env.SUPABASE_URL || "";
const supabaseKey = process.env.SUPABASE_KEY || "";
export const supabase = createClient(supabaseUrl, supabaseKey);
export async function insertEmbedding(docName: string, embedding: number[]) {
const { data, error } = await supabase.from("docs").insert([
{
doc_name: docName,
embedding: embedding,
},
]);
}
Python Code:
import os
from supabase import create_client, Client
from dotenv import load_dotenv
load_dotenv()
# Initialize Supabase client
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)
def insert_embedding(doc_name: str, embedding: list[float]):
supabase.table("docs").upsert({"doc_name": doc_name, "embedding": embedding}).execute()
def get_embedding(doc_name: str) -> list[float]:
response = supabase.table("embeddings").select("embedding").eq("doc_name", doc_name).execute()
return response.data[0]["embedding"]
Now, let us write code to read file contents, create embeddings and store it in the database.
Typescript Code:
async function createEmbeddingFromFiles(files: string[]) {
for (const file of files) {
const text = await fs.readFile(file, "utf8");
const embedding = await createEmbeddingOpenAI(text);
await insertEmbedding(file, embedding);
}
}
Python Code:
def create_embedding_from_files(files: list[str]):
for file in files:
with open(file, "r") as f:
text = f.read()
embedding = create_embedding_openai(text)
insert_embedding(file, embedding)
We're able to convert our documents to embeddings and store it in the database. Let's discuss how we can query the documents.