microagent model
Last updated: 2026-06-27
microagent model pull <hf-ref> [--token <t>] [--state-dir <dir>] Download a GGUF modelmicroagent model list [--state-dir <dir>] List stored modelsmicroagent model delete <ref> [--keep-files] [--state-dir <dir>] Remove a model and its blobmicroagent model prune [--delete-files] [--state-dir <dir>] Drop records for missing blobsmicroagent model serve <hf-ref> [--dedicated] [--runner <llamacpp|vllm|custom>] [--runner-gpu <off|on|auto>] [--runner-model <id>] [--runner-served-model <name>] [--runner-command <template>] [--runner-name <name>] [--runner-health-path <path>] [--runner-arg <arg>] [--runner-env KEY=VALUE] [--token <t>] [--state-dir <dir>] Serve a model on the hostmicroagent model serve <hf-ref> [--dedicated] [--runner <llamacpp|vllm|custom>] [--runner-gpu <off|on|auto>] [--runner-model <id>] [--runner-served-model <name>] [--runner-command <template>] [--runner-name <name>] [--runner-health-path <path>] [--runner-arg <arg>] [--runner-env KEY=VALUE] [--token <t>] [--state-dir <dir>] Alias for model servemicroagent model stop <hf-ref> [--state-dir <dir>] Stop a model's runnersmicroagent model runners [--state-dir <dir>] List running model serversmicroagent model policy validate <policy.json> Validate a mediation policy filemicroagent model policy evaluate <policy.json> [options] Dry-run a policy file against structured request metadatamodel manages a local content-addressed store of GGUF model files and the
host model server processes that serve them. Downloaded blobs are stored under
~/.microagent/models/ by default, indexed by the HuggingFace reference used
to pull them. All subcommands read and write this index; no remote state is
modified by the store commands. The server commands (serve, stop,
runners) manage long-running host model runner processes. The built-in
default runner is llama-server, but the runner command is configurable. Pair
a workspace with a served model using run --model for one-shots
or create --model for a persistent pairing that every
start re-establishes.
Examples
Section titled “Examples”Download and manage stored models:
# Download a public modelmicroagent model pull TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf
# Download a gated model with an explicit tokenmicroagent model pull hf.co/meta-llama/Llama-2-7B/llama-2-7b.gguf --token hf_xxxxx
# List stored modelsmicroagent model list
# Remove a model and delete its blobmicroagent model delete TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf
# Remove records for missing blobs (safe; no files deleted)microagent model pruneServe a model and manage the runners:
# Start a shared pinned model server (auto-pulls if not stored)microagent model serve TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf
# Start a dedicated runner (exclusive to this caller)microagent model serve TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf --dedicated
# Pass host runner arguments to the selected runnermicroagent model serve TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf \ --runner-arg -ngl --runner-arg all
# Use a custom OpenAI-compatible host runner command templatemicroagent model serve TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf \ --runner-command 'runner serve {model} --host {host} --port {port}' \ --runner-name runner
# List running model serversmicroagent model runners
# Stop all runners for a modelmicroagent model stop TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.ggufmodel list prints one tab-separated row per recorded model (no header):
hf.co/TheBloke/Llama-2-7B-GGUF@main/llama-2-7b.Q4_K_M.gguf 3825819648 sha256:abc...With the global --json flag, records are returned under models:
{ "models": [ { "model_ref": "hf.co/TheBloke/Llama-2-7B-GGUF@main/llama-2-7b.Q4_K_M.gguf", "resolved_ref": "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf", "digest": "sha256:abc...", "size_bytes": 3825819648, "output_path": "/home/user/.microagent/models/blobs/a1b2c3d4e5f6a7b8c9d0e1f2.gguf", "last_used_at": "2026-06-01T12:00:00Z" } ]}Commands
Section titled “Commands”| Command | Description |
|---|---|
pull | Download a GGUF file from HuggingFace and record it in the local store |
list | List locally stored model records (ref, size, digest) |
delete | Remove a model record, and optionally delete its blob |
prune | Remove records whose blob files are missing; with --delete-files, also delete all indexed blob files |
serve | Start (or reuse) a pinned host model server for a model; auto-pulls if not yet stored |
stop | Force-stop all model server processes for a model ref |
runners | List currently running model server processes |
policy validate | Validate a structured model mediation policy file |
policy evaluate | Dry-run a policy file against structured request metadata |
serve supports three runner backends. llamacpp is the default and uses
llama-server with CPU execution unless GPU use is explicitly requested.
vllm starts vLLM’s OpenAI-compatible API server and requires
--runner-model <hf-model-id> plus MICROAGENT_VLLM_PYTHON when vLLM is not
available through python3. custom runs an operator-supplied
OpenAI-compatible command template. If the model is not yet in the local GGUF
store, serve pulls it automatically before starting the server. The runner is
started pinned, so it stays alive even when no workspace holds it.
microagent starts the default llama.cpp runner with --device none --gpu-layers 0. Pointing MICROAGENT_LLAMA_SERVER at a CUDA-enabled binary is
therefore not enough to opt into GPU use. Pass --runner-gpu on, --runner-gpu auto, or explicit llama.cpp GPU args such as --runner-arg -ngl --runner-arg all.
Use the named vLLM backend when the host should run vLLM rather than llama.cpp:
MICROAGENT_VLLM_PYTHON=../vllm/.venv/bin/python \ microagent model serve org/stub/stub.gguf \ --runner vllm \ --runner-model Qwen/Qwen2.5-0.5B-Instruct \ --runner-served-model local-chat \ --runner-arg --max-model-len --runner-arg 2048The <hf-ref> argument remains the microagent model-store pairing ref. vLLM
loads the backend model named by --runner-model, and clients send the
--runner-served-model value when it is set.
Set MICROAGENT_MODEL_RUNNER_COMMAND or pass --runner-command to use a custom
OpenAI-compatible host runner. The command is parsed as argv fields, not shell
evaluated. It must include {model} and either {port} or {addr}; {host}
is also available. The environment variable accepts shell-like fields or a JSON
string array:
MICROAGENT_MODEL_RUNNER_COMMAND='runner serve {model} --host {host} --port {port}' \ MICROAGENT_MODEL_RUNNER_NAME=runner \ microagent start research
MICROAGENT_MODEL_RUNNER_COMMAND='["runner","serve","{model}","--listen","{addr}"]' \ microagent model serve TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.ggufUse MICROAGENT_MODEL_RUNNER_NAME and
MICROAGENT_MODEL_RUNNER_HEALTH_PATH to label a custom runner and change its
readiness probe path; the matching one-shot flags are --runner-name and
--runner-health-path.
Runner arguments are opaque host-runner configuration. Use repeatable
--runner-arg flags for a single model serve invocation, or set
MICROAGENT_MODEL_RUNNER_ARGS to apply defaults to any model runner that
microagent starts, including workspace run --model, create --model, and
later start re-pairing. The environment variable accepts shell-like fields
or a JSON string array:
MICROAGENT_MODEL_RUNNER_ARGS='-ngl all' microagent model serve \ TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf
MICROAGENT_MODEL_RUNNER_ARGS='["-ngl","all"]' microagent start researchUse repeatable --runner-env KEY=VALUE flags for a single model serve
invocation, or MICROAGENT_MODEL_RUNNER_ENV for defaults. The environment
variable accepts shell-like KEY=VALUE fields, a JSON string array, or a JSON
object. Runner env keys, not values, are recorded in the runner registry.
Workspace --model-runner-env values apply only to the current run,
create, or start invocation and are not persisted in workspace manifests.
Workspaces hold runners. run --model holds one for the duration of the run.
A workspace created with create --model re-pairs on every start and holds
until halt, stop, kill, or delete releases it - a guest that exits on
its own keeps its hold until the next lifecycle verb. An unpinned runner stops
when its last holder releases; a pinned one (model serve) stays up.
Runner backend, GPU intent, command template, args, and model mediation config
from create --model are persisted so start and supervise replay the same
pairing. start accepts the same --model-runner* and --model-mediation*
flags as overrides for a single boot.
When an existing workspace attaches or releases a model runner, microagent
appends model_worker=attached and model_worker=released markers to the
workspace’s events history. These markers record the model
ref, holder, runner engine, process ID, and runner config digest for tracing.
Runner environment values are not recorded.
stop force-stops all model server processes for the given ref (ignores
whether the runner is pinned) and removes their entries from the runner index.
Use it to reclaim a runner whose workspace exited without a lifecycle verb.
runners self-heals the registry: any listed process that is no longer alive
is silently removed before the list is printed.
Model mediation policy files
Section titled “Model mediation policy files”Set MICROAGENT_MODEL_MEDIATION=policy to require a policy source for the
host-worker mediator. The source can be either an external decision endpoint
with MICROAGENT_MODEL_POLICY_URL=http://127.0.0.1:9000/decision, or a local
structured policy file with MICROAGENT_MODEL_POLICY_FILE=/path/to/policy.json.
The two sources are mutually exclusive. File policies fail closed by default
and inspect only structured request metadata and aggregate body counts; prompt
text is not written into mediation audit logs.
{ "schema_version": "microagent.model_policy.v1", "default": "deny", "rules": [ { "id": "small-chat", "effect": "allow", "match": { "methods": ["POST"], "paths": ["/v1/chat/completions"], "models": ["local-model"] }, "limits": { "max_request_bytes": 32768, "max_text_bytes": 4096, "max_messages": 16, "max_tokens": 512, "stream": false, "allowed_tool_names": ["shell", "read_file"] } } ]}The file policy match fields are workspace_ids, capabilities, worker_ids,
methods, paths, and models; empty match fields are wildcards. paths
matches the request path received by the mediator. For the default
MICROAGENT_MODEL_URL exposed inside a workspace, that path includes /v1.
Limit fields are max_request_bytes, max_text_bytes, max_messages,
max_tokens, stream, and allowed_tool_names. If an allow rule matches but
a limit fails, the request is denied rather than falling through to a later
rule.
File policy can mediate the request method, path, workspace/capability/worker identity, declared model, declared stream mode, declared token cap, declared tool/function names, request bytes, message count, and aggregate text bytes. It does not inspect prompt meaning, response content, semantic tool intent, quotas, trust scores, billing rules, or user/business authorization. Use the external policy URL path when those decisions need a policy service; microagent still owns the fail-closed host enforcement around that decision.
Validate a generated file before using it:
microagent model policy validate ./model-policy.jsonmicroagent --json model policy validate ./model-policy.jsonDry-run a structured request without starting a model runner or VM:
microagent model policy evaluate ./model-policy.json \ --method POST \ --path /v1/chat/completions \ --model local-model \ --max-tokens 128 \ --stream false \ --tool shell \ --text-bytes 512 \ --messages 1 \ --expect allowpolicy evaluate exits nonzero only when the policy file is invalid, the
sample metadata is invalid, or --expect does not match the evaluated
decision. A denied decision is otherwise a successful dry run and is printed as
deny with the policy reason. Use --json for automation.
HuggingFace ref forms
Section titled “HuggingFace ref forms”model pull accepts several ref forms for the <hf-ref> argument:
| Form | Example |
|---|---|
hf.co/<org>/<repo>/<file>.gguf | hf.co/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf |
huggingface.co/<org>/<repo>/<file>.gguf | huggingface.co/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf |
Bare <org>/<repo>/<file>.gguf | TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf |
Full resolve URL | https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf |
An optional @<rev> suffix after the repository name pins a specific revision;
when omitted the ref resolves to main:
TheBloke/Llama-2-7B-GGUF@abc123/llama-2-7b.Q4_K_M.ggufAuthentication
Section titled “Authentication”model pull authenticates to HuggingFace using a bearer token. The token is
resolved in this order:
--token <t>flagHF_TOKENenvironment variableHUGGING_FACE_HUB_TOKENenvironment variable
If none is set, the pull is attempted without authentication. Public models do not require a token.
Store location
Section titled “Store location”Downloaded blobs are stored in a content-addressed layout under the state directory:
~/.microagent/models/ index.json # ref → digest mapping blobs/ <24-hex-chars>.gguf # raw GGUF file, named by first 24 hex chars of sha256(canonical-ref)model delete removes the index record and, unless --keep-files is set, deletes
the corresponding blob. model prune removes index records whose blob files
are missing from disk. With --delete-files, it also deletes the blob file of
every remaining indexed record (i.e. all indexed blobs are deleted).
Most subcommands take only --state-dir <dir> (state directory, default
~/.microagent/); the flags that change behavior are --token (pull/serve),
--keep-files (delete), --delete-files (prune), --dedicated (serve), and
the host runner flags for serve.
Pull flags
Section titled “Pull flags”| Flag | Description |
|---|---|
--token <t> | HuggingFace bearer token (falls back to HF_TOKEN, then HUGGING_FACE_HUB_TOKEN) |
--state-dir <dir> | State directory (default ~/.microagent/) |
Serve flags
Section titled “Serve flags”| Flag | Description |
|---|---|
--dedicated | Start a dedicated runner instead of reusing a shared one |
--runner <backend> | Runner backend: llamacpp (default), vllm, or custom |
--runner-gpu <mode> | Runner GPU intent: off (llama.cpp default), on, or auto |
--runner-model <id> | Backend model id for runners such as vLLM |
--runner-served-model <name> | OpenAI-compatible served model name for runners such as vLLM |
--runner-command <template> | Custom host model runner command template |
--runner-name <name> | Name to record for a custom host model runner |
--runner-health-path <path> | HTTP health path for a custom host model runner |
--runner-arg <arg> | Extra host model runner argument. Repeat for multiple argv entries |
--runner-env KEY=VALUE | Extra host model runner environment override. Repeat for multiple variables |
--token <t> | HuggingFace bearer token used if the model must be auto-pulled |
--state-dir <dir> | State directory (default ~/.microagent/) |
Remove flags
Section titled “Remove flags”| Flag | Description |
|---|---|
--keep-files | Remove the index record but keep the blob file on disk |
--state-dir <dir> | State directory (default ~/.microagent/) |
Prune flags
Section titled “Prune flags”| Flag | Description |
|---|---|
--delete-files | Also delete the blob files of all indexed models (not just orphaned/missing ones) |
--state-dir <dir> | State directory (default ~/.microagent/) |
See global flags for --json/--text/--output/--mode.
Exit status
Section titled “Exit status”model subcommands exit 0 on success; nonzero when a ref cannot be parsed, a
download or authentication fails, a record is not found, or serve cannot
start the selected host model runner. In AX mode a failure is written as a
structured error envelope.