---
nav_order: 0035
parent: Decision Records
---

# Generate Embeddings Online

## Context and Problem Statement

In order to perform a question and answering (Q&A) session over research papers
with large language model (LLM), we need to process each file: each file should
be converted to string, then this string is split into chunks, and for each chunk
an embedding vector should be generated.

Where these embeddings should be generated?

## Considered Options

* Local embedding model with `langchain4j`
* OpenAI embedding API

## Decision Drivers

* Embedding generation should be fast
* Embeddings should have good performance (performance mean they "catch the semantics" good, see also [MTEB](https://huggingface.co/blog/mteb))
* Generating embeddings should be cheap
* Embeddings should not be of a big size
* Embedding models and library to generate embeddings shouldn't be big in distribution binary.

## Decision Outcome

Chosen option: "OpenAI embedding API", because
the distribution size of JabRef will be nearly unaffected. Also, it's fast
and has a better performance, in comparison to available in `langchain4j`'s model `all-MiniLM-L6-v2`.

## Pros and Cons of the Options

### Local embedding model with `langchain4j`

* Good, because works locally, privacy saved, no Internet connection is required
* Good, because user doesn't pay for anything
* Neutral, because how fast embedding generation is depends on chosen model. It may be small and fast, or big and time-consuming
* Neutral, because local embedding models may have less performance than OpenAI's (for example). *Actually, most embedding models suitable for use in JabRef are about ~50% performant)
* Bad, because embedding generation takes computer resources
* Bad, because the only framework to run embedding models in Java is ONNX, and it's very heavy in distribution binary

### OpenAI embedding API

* Good, because we delegate the task of generating embeddings to an online service, so the user's computer is free to do some other job
* Good, because OpenAI models have typically have better performance
* Good, because JabRef distribution size will practically be unaffected
* Bad, because user should agree to send data to a third-party service, Internet connection is required
* Bad, because user pay for embedding generation (see also [OpenAI embedding models pricing](https://platform.openai.com/docs/guides/embeddings/embedding-models))