Custom LLM Inference Runtime¶
The current platform’s LLM inference service supports three built-in runtimes: vLLM, SGLang, and Image Generation. To meet more diverse scenario requirements, the platform provides the ability to define custom runtimes. Users may freely define new runtime types based on their business needs and configure startup scripts, runtime parameters, and other related information, enabling more flexible inference service deployments.
You can follow the steps below to configure and use Hydra’s custom runtime feature.
Add a Custom Runtime to the Global Service Cluster¶
-
Go to the global service cluster, click Configuration & Secrets -> Configurations, search for
hydra-runtime-template, where theconfig.yamlfile is used to define runtime templates. -
Click Edit YAML and modify the
config.yamlfile to add a custom runtime.The structure corresponding to config.yaml is as follows:
templates: - runtime: string # Runtime type in English (e.g., vllm), required runtimeZH: string # Runtime type in Chinese podTemplate: # Pod template definition initContainers: [] # Init containers, same as k8s resource definitions podSecurityContext: {} # Pod security context, same as k8s resource definitions volumes: [] # Volume definitions, same as k8s resource definitions containerTemplate: # Container template commandTemplate: # Startup command template, array, using Go template syntax - "" argsTemplate: # Startup args template, array, using Go template syntax - "" volumeMounts: [] # Volume mounts, same as k8s resource definitions ports: [] # Ports, same as k8s resource definitions securityContext: {} # Container security context, same as k8s resource definitionsExample: Add vllm-cpu runtime template
templates: - runtime: vllm-cpu runtimeZH: vLLM podTemplate: containerTemplate: commandTemplate: - "/bin/bash" - "-c" argsTemplate: - |- {{- if .IS_DISTRIBUTED -}} {{- if .IS_LEADER -}} ray start --head --port={{ .RAY_PORT }} && vllm serve {{ .MODEL_PATH }} --served-model-name {{ .MODEL_NAME }} --trust-remote-code --tensor-parallel-size={{ .TP_SIZE }} --pipeline-parallel-size={{ .PP_SIZE }} {{- else -}} ray start --block --address=$(LWS_LEADER_ADDRESS):{{ .RAY_PORT }} {{- end -}} {{- else -}} vllm serve {{ .MODEL_PATH }} --served-model-name {{ .MODEL_NAME }} --trust-remote-code {{- if gt .TP_SIZE 1 }} --tensor-parallel-size {{ .TP_SIZE }} {{- end -}} {{- end -}} {{- if .CUSTOM_ARGS -}} {{ range .CUSTOM_ARGS }} {{ . }} {{- end -}} {{- end -}}Available Variable Description
Set the variables in the template according to your actual needs. Available variables include:
- RUN_TIME: Runtime type, such as vllm, image-gen, sglang, mindie, etc.
- MODEL_NAME: Model name, e.g., the
--served-model-nameparameter for vLLM - IS_DISTRIBUTED: Whether distributed deployment is enabled
- MODEL_PATH: Model path, currently fixed as
/data/serving-model - IS_LEADER: Whether this is the leader node in a distributed setup
- TP_SIZE: Tensor parallel size
- PP_SIZE: Pipeline parallel size
- MODEL_ID: Model ID
- CLUSTER: Deployment cluster
- NAMESPACE: Deployment namespace
- MODEL_HOST: Model serving host address, currently fixed at
0.0.0.0 - MODEL_PORT: Model serving port, currently fixed at
8000
CUSTOM_ARGS Parameter Description
To reduce configuration complexity, parameters configured in the deployment template and in
hydra-agentwill be passed into the template as a singleCUSTOM_ARGSvariable. It is recommended to follow the approach below:- Define only the startup command in
commandTemplate, such as"/bin/bash -c". It will be rendered into the container’s command. Add{{- if .CUSTOM_ARGS -}} {{ range .CUSTOM_ARGS }} {{ . }} {{- end -}} {{- end -}}at the end ofargsTemplateto ensure custom parameters are correctly rendered. - If no template is defined, the parameters configured in
model_deploymentandhydra-agentwill be used as the container's arguments. - If
commandTemplateis not defined, note thatargsTemplateshould be multiple lines rather than a single-line args entry, and args should use equals signs to connect values (e.g.,--model={{ .MODEL_PATH }}). - The command uses
/bin/bash -cbecause this method requires args to be a single-line string. Therefore, CUSTOM_ARGS must be added to the template. - If no command is specified and args is a multi-line array, CUSTOM_ARGS does not need to be added manually; it will be automatically appended during final rendering to avoid argument parsing failures.
Add Image Information in the Work Cluster¶
-
Go to the work cluster, click Configuration & Secrets -> Configurations, search for
hydra-agent, and add custom runtime image information and compute-capability matching rules inconfigmap/deployment_templates. -
Click Edit YAML and modify the
configmapfile to add image information for the custom runtime.YAML Example
Start Using It¶
After completing the above steps, the operations administrator can select the custom runtime in the deployment configuration and set up model deployment information.