Create Embeddings

Convert text to embeddings

In this section, we'll take simple text document and convert it to embeddings

There are many models available to convert from text to embeddings from open source to closed source and from cheap to expensive models. The objective of this guide is to learn about RAG and build a simple RAG application.

We're going to use text-embedding-3-small model of openai as most of us knows about openai .

Below is the method for converting text to embedding (array of numbers). Each model will have their unique way to convert text to embeddings and each has its own array length.

export async function createEmbeddingOpenAI(text: string) {
  const embeddingModel = "text-embedding-3-small";
  const embedding = await client.embeddings.create({
    input: text,
    model: embeddingModel,
    encoding_format: "float",
  });
  return embedding.data[0].embedding;
}

Now, we've the ability to convert text to embeddings. We should know how to store the embeddings. In last section, we've mentioned that we're going to use pgvector extension of postgres to store. We're going to supabase service for postgres database as they provide free database. Once, you've enabled the extension - you can store the data in vector column.

Below is a simple postgres table with 2 columns - doc_name and embedding .

embedding column is responsible for storing vector information. The length of the vector depends on the embedding model that you use. As we're using text-embedding-3-small model, the vector length is 1536.

CREATE TABLE docs (
    doc_name TEXT NOT NULL,
    embedding vector(1536),
    PRIMARY KEY (doc_name)
);

We know how to convert text to embedding and we know where to store. Let's write some utility function to store the embedding in the postgres database.

import { createClient } from "@supabase/supabase-js";

const supabaseUrl = process.env.SUPABASE_URL || "";
const supabaseKey = process.env.SUPABASE_KEY || "";

export const supabase = createClient(supabaseUrl, supabaseKey);

export async function insertEmbedding(docName: string, embedding: number[]) {
  const { data, error } = await supabase.from("docs").insert([
    {
      doc_name: docName,
      embedding: embedding,
    },
  ]);
}

Now, let us write code to read file contents, create embeddings and store it in the database.

async function createEmbeddingFromFiles(files: string[]) {
  for (const file of files) {
    const text = await fs.readFile(file, "utf8");
    const embedding = await createEmbeddingOpenAI(text);
    await insertEmbedding(file, embedding);
  }
}

We're able to convert our documents to embeddings and store it in the database. Let's discuss how we can query the documents.