Omit the previous command ("binary" in the spec) as the vLLM images use the
entrypoint to run vLLM.
Fix --served-model-name option.
This refered to a non-existant model.model_name ctx variable.
Use model.alias instead.
Use precisely the same model name from the chat client.
vLLM will exit when this is not equal to the served-model-name.
Signed-off-by: Oliver Walsh <owalsh@redhat.com>
rag_framework is now a proxy that enriches requests to the LLM with RAG context. Run it in a separate
container and send requests from the chat interface to the RAG proxy.
Generate the rag_framework command using CommandFactory.
Signed-off-by: Mike Bonnet <mikeb@redhat.com>
rag_framework is now a proxy that enriches requests to the LLM with RAG context. Run it in a separate
container and send requests from the chat interface to the RAG proxy.
Generate the rag_framework command using CommandFactory.
Signed-off-by: Mike Bonnet <mikeb@redhat.com>
Update doc2rag and rag_framework to load models from the local filesystem only, avoiding
unnecessary round-trips to external repos.
Convert rag_framework to use async vector db clients.
Pass the --debug option through to the scripts.
Signed-off-by: Mike Bonnet <mikeb@redhat.com>
Move model conversion and quantization from the build process into
separate runtime operations, and build the results into a container,
simplifying the process.
Use Engine and BuildEngine to handle the container manager operations
and reduce direct command execution.
Use the inference spec to define the "convert" and "quantize" interfaces.
Signed-off-by: Mike Bonnet <mikeb@redhat.com>
Use the CommandFactory to generate the doc2rag command-line, and create a subclass of Engine
to handle executing it in a container.
Add a dedicated mapping of env vars to rag images in the config, and use that to select the
rag image in the cli, making image selection more consistent.
Signed-off-by: Mike Bonnet <mikeb@redhat.com>
Relates to: https://github.com/containers/ramalama/pull/1982
Previously, the --max-tokens param was integrated in the daemon internal
command factory. With the introduction of the spec, this command factory
has now been replaced by the spec and the --max-tokens option added to
the llama.cpp one.
Signed-off-by: Michael Engel <mengel@redhat.com>