54 lines
2.4 KiB
Markdown
54 lines
2.4 KiB
Markdown
---
|
|
nav_order: 0035
|
|
parent: Decision Records
|
|
---
|
|
|
|
# Generate Embeddings Online
|
|
|
|
## Context and Problem Statement
|
|
|
|
In order to perform a question and answering (Q&A) session over research papers
|
|
with large language model (LLM), we need to process each file: each file should
|
|
be converted to string, then this string is split into chunks, and for each chunk
|
|
an embedding vector should be generated.
|
|
|
|
Where these embeddings should be generated?
|
|
|
|
## Considered Options
|
|
|
|
* Local embedding model with `langchain4j`
|
|
* OpenAI embedding API
|
|
|
|
## Decision Drivers
|
|
|
|
* Embedding generation should be fast
|
|
* Embeddings should have good performance (performance mean they "catch the semantics" good, see also [MTEB](https://huggingface.co/blog/mteb))
|
|
* Generating embeddings should be cheap
|
|
* Embeddings should not be of a big size
|
|
* Embedding models and library to generate embeddings shouldn't be big in distribution binary.
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "OpenAI embedding API", because
|
|
the distribution size of JabRef will be nearly unaffected. Also, it's fast
|
|
and has a better performance, in comparison to available in `langchain4j`'s model `all-MiniLM-L6-v2`.
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Local embedding model with `langchain4j`
|
|
|
|
* Good, because works locally, privacy saved, no Internet connection is required
|
|
* Good, because user doesn't pay for anything
|
|
* Neutral, because how fast embedding generation is depends on chosen model. It may be small and fast, or big and time-consuming
|
|
* Neutral, because local embedding models may have less performance than OpenAI's (for example). *Actually, most embedding models suitable for use in JabRef are about ~50% performant)
|
|
* Bad, because embedding generation takes computer resources
|
|
* Bad, because the only framework to run embedding models in Java is ONNX, and it's very heavy in distribution binary
|
|
|
|
### OpenAI embedding API
|
|
|
|
* Good, because we delegate the task of generating embeddings to an online service, so the user's computer is free to do some other job
|
|
* Good, because OpenAI models have typically have better performance
|
|
* Good, because JabRef distribution size will practically be unaffected
|
|
* Bad, because user should agree to send data to a third-party service, Internet connection is required
|
|
* Bad, because user pay for embedding generation (see also [OpenAI embedding models pricing](https://platform.openai.com/docs/guides/embeddings/embedding-models))
|