SoftwareQuality/jabref/docs/decisions/0035-generate-embeddings-online.md
Artem Semenovykh 415abbc47b import jabref
2024-11-16 11:43:42 +01:00

54 lines
2.4 KiB
Markdown

---
nav_order: 0035
parent: Decision Records
---
# Generate Embeddings Online
## Context and Problem Statement
In order to perform a question and answering (Q&A) session over research papers
with large language model (LLM), we need to process each file: each file should
be converted to string, then this string is split into chunks, and for each chunk
an embedding vector should be generated.
Where these embeddings should be generated?
## Considered Options
* Local embedding model with `langchain4j`
* OpenAI embedding API
## Decision Drivers
* Embedding generation should be fast
* Embeddings should have good performance (performance mean they "catch the semantics" good, see also [MTEB](https://huggingface.co/blog/mteb))
* Generating embeddings should be cheap
* Embeddings should not be of a big size
* Embedding models and library to generate embeddings shouldn't be big in distribution binary.
## Decision Outcome
Chosen option: "OpenAI embedding API", because
the distribution size of JabRef will be nearly unaffected. Also, it's fast
and has a better performance, in comparison to available in `langchain4j`'s model `all-MiniLM-L6-v2`.
## Pros and Cons of the Options
### Local embedding model with `langchain4j`
* Good, because works locally, privacy saved, no Internet connection is required
* Good, because user doesn't pay for anything
* Neutral, because how fast embedding generation is depends on chosen model. It may be small and fast, or big and time-consuming
* Neutral, because local embedding models may have less performance than OpenAI's (for example). *Actually, most embedding models suitable for use in JabRef are about ~50% performant)
* Bad, because embedding generation takes computer resources
* Bad, because the only framework to run embedding models in Java is ONNX, and it's very heavy in distribution binary
### OpenAI embedding API
* Good, because we delegate the task of generating embeddings to an online service, so the user's computer is free to do some other job
* Good, because OpenAI models have typically have better performance
* Good, because JabRef distribution size will practically be unaffected
* Bad, because user should agree to send data to a third-party service, Internet connection is required
* Bad, because user pay for embedding generation (see also [OpenAI embedding models pricing](https://platform.openai.com/docs/guides/embeddings/embedding-models))