Create Embeddings

Convert text to embeddings

In this section, we'll take simple text document and convert it to embeddings

There are many models available to convert from text to embeddings from open source to closed source and from cheap to expensive models. The objective of this guide is to learn about RAG and build a simple RAG application.

We're going to use text-embedding-3-small model of openai as most of us knows about openai .

Below is the method for converting text to embedding (array of numbers). Each model will have their unique way to convert text to embeddings and each has its own array length.

Typescript Code:

export async function createEmbeddingOpenAI(text: string) {
  const embeddingModel = "text-embedding-3-small";
  const embedding = await client.embeddings.create({
    input: text,
    model: embeddingModel,
    encoding_format: "float",
  });
  return embedding.data[0].embedding;
}

Python Code:

import openai
import os

openai.api_key = os.environ.get("OPENAI_API_KEY")

def create_embedding_openai(text: str):
    response = openai.embeddings.create(input=text, model="text-embedding-3-small", encoding_format="float")
    return response.data[0].embedding

Now, we've the ability to convert text to embeddings. We should know how to store the embeddings. In last section, we've mentioned that we're going to use pgvector extension of postgres to store. We're going to supabase service for postgres database as they provide free database. Once, you've enabled the extension - you can store the data in vector column.

Below is a simple postgres table with 2 columns - doc_name and embedding .

embedding column is responsible for storing vector information. The length of the vector depends on the embedding model that you use. As we're using text-embedding-3-small model, the vector length is 1536.

CREATE TABLE docs (
    doc_name TEXT NOT NULL,
    embedding vector(1536),
    PRIMARY KEY (doc_name)
);

We know how to convert text to embedding and we know where to store. Let's write some utility function to store the embedding in the postgres database.

Typescript Code:

import { createClient } from "@supabase/supabase-js";

const supabaseUrl = process.env.SUPABASE_URL || "";
const supabaseKey = process.env.SUPABASE_KEY || "";

export const supabase = createClient(supabaseUrl, supabaseKey);

export async function insertEmbedding(docName: string, embedding: number[]) {
  const { data, error } = await supabase.from("docs").insert([
    {
      doc_name: docName,
      embedding: embedding,
    },
  ]);
}

Python Code:

import os
from supabase import create_client, Client
from dotenv import load_dotenv
load_dotenv()



# Initialize Supabase client
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)


def insert_embedding(doc_name: str, embedding: list[float]):
    supabase.table("docs").upsert({"doc_name": doc_name, "embedding": embedding}).execute()

def get_embedding(doc_name: str) -> list[float]:
    response = supabase.table("embeddings").select("embedding").eq("doc_name", doc_name).execute()
    return response.data[0]["embedding"]

Now, let us write code to read file contents, create embeddings and store it in the database.

Typescript Code:

async function createEmbeddingFromFiles(files: string[]) {
  for (const file of files) {
    const text = await fs.readFile(file, "utf8");
    const embedding = await createEmbeddingOpenAI(text);
    await insertEmbedding(file, embedding);
  }
}

Python Code:

def create_embedding_from_files(files: list[str]):
    for file in files:
        with open(file, "r") as f:
            text = f.read()
            embedding = create_embedding_openai(text)
            insert_embedding(file, embedding)

We're able to convert our documents to embeddings and store it in the database. Let's discuss how we can query the documents.