fix(cli): status probe accepts reasoning_content for thinking models

Live deploy showed qwen3-thinking failing the probe with "empty content": at max_tokens=8 the model spent its entire budget on the reasoning trace and never emitted a final \`content\` block. Fix: - Bump max_tokens to 64. Still caps latency at ~1-2 sec on cheap models but gives reasoning models enough headroom. - If \`message.content\` is empty but \`reasoning_content\` is non-empty, count it as alive and prefix the preview with "[thinking]" so the user knows the model didn't actually answer "hi" but is responsive. - Replace the prompt with the terser "Reply with just: hi" — closer to what a thinking model can short-circuit on. Tests: existing 25 pass; the failure-path test still asserts on the "empty content" path because reasoning_content is empty there too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(cli): live "say hi" probe for server LLMs in mcpctl status (#61 )
2026-04-27 12:09:42 +01:00 · 2026-04-27 11:02:26 +00:00
1 changed files with 28 additions and 9 deletions
--- a/src/cli/src/commands/status.ts
+++ b/src/cli/src/commands/status.ts
@@ -227,13 +227,21 @@ function defaultFetchServerLlms(mcpdUrl: string, token: string | null): Promise<
 /**
 * POST a tiny "say hi" prompt to /api/v1/llms/<name>/infer and decide if
 * the LLM actually serves inference. Returns ok=true when the response is
- * 200 with a non-empty assistant message; otherwise ok=false with an
- * error string suitable for one-line display.
+ * 200 with non-empty content OR reasoning_content (thinking models often
+ * spend their token budget on the reasoning trace and never emit a
+ * `content` block, but they're clearly alive if reasoning came back).
+ * Otherwise ok=false with an error string suitable for one-line display.
+ *
+ * `max_tokens: 64` gives reasoning models enough headroom to emit
+ * something visible while still capping latency at ~1-2 sec on cheap
+ * models. The exact wording — "Reply with just: hi" — is more terse and
+ * closer to what a thinking model can short-circuit on without burning
+ * its entire budget on reasoning.
 */
 const PROBE_TIMEOUT_MS = 15_000;
 const PROBE_BODY = JSON.stringify({
-  messages: [{ role: 'user', content: "Say exactly the word 'hi' and nothing else." }],
-  max_tokens: 8,
+  messages: [{ role: 'user', content: 'Reply with just: hi' }],
+  max_tokens: 64,
  temperature: 0,
 });

@@ -276,20 +284,31 @@ function defaultProbeServerLlm(mcpdUrl: string, name: string, token: string | nu
            return;
          }
          let content = '';
+          let reasoning = '';
          try {
            const parsed = JSON.parse(body) as {
-              choices?: Array<{ message?: { content?: string } }>;
+              choices?: Array<{ message?: { content?: string; reasoning_content?: string } }>;
            };
-            content = parsed.choices?.[0]?.message?.content?.trim() ?? '';
+            const msg = parsed.choices?.[0]?.message;
+            content = msg?.content?.trim() ?? '';
+            reasoning = msg?.reasoning_content?.trim() ?? '';
          } catch {
            resolve({ ok: false, ms, error: 'invalid response body' });
            return;
          }
-          if (content === '') {
-            resolve({ ok: false, ms, error: 'empty content' });
+          if (content !== '') {
+            resolve({ ok: true, ms, say: content.slice(0, 16) });
            return;
          }
-          resolve({ ok: true, ms, say: content.slice(0, 16) });
+          if (reasoning !== '') {
+            // Thinking model burned its budget on the reasoning trace
+            // before emitting `content`. The LLM is alive — flag it as
+            // ok and surface a short reasoning preview so the user can
+            // tell at a glance.
+            resolve({ ok: true, ms, say: `[thinking] ${reasoning.slice(0, 12)}` });
+            return;
+          }
+          resolve({ ok: false, ms, error: 'empty content' });
        });
      });
    } catch {
Author	SHA1	Message	Date
Michal	a84214dad1	fix(cli): status probe accepts reasoning_content for thinking models Some checks failed CI/CD / typecheck (pull_request) Successful in 56s Details CI/CD / lint (pull_request) Successful in 3m6s Details CI/CD / test (pull_request) Successful in 1m9s Details CI/CD / build (pull_request) Successful in 2m39s Details CI/CD / smoke (pull_request) Failing after 3m58s Details CI/CD / publish (pull_request) Has been skipped Details Live deploy showed qwen3-thinking failing the probe with "empty content": at max_tokens=8 the model spent its entire budget on the reasoning trace and never emitted a final \`content\` block. Fix: - Bump max_tokens to 64. Still caps latency at ~1-2 sec on cheap models but gives reasoning models enough headroom. - If \`message.content\` is empty but \`reasoning_content\` is non-empty, count it as alive and prefix the preview with "[thinking]" so the user knows the model didn't actually answer "hi" but is responsive. - Replace the prompt with the terser "Reply with just: hi" — closer to what a thinking model can short-circuit on. Tests: existing 25 pass; the failure-path test still asserts on the "empty content" path because reasoning_content is empty there too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:09:42 +01:00
michal	54e56f7b71	feat(cli): live "say hi" probe for server LLMs in mcpctl status (#61 ) Some checks failed CI/CD / lint (push) Successful in 57s Details CI/CD / typecheck (push) Successful in 57s Details CI/CD / test (push) Has been cancelled Details CI/CD / smoke (push) Has been cancelled Details CI/CD / build (push) Has been cancelled Details CI/CD / publish (push) Has been cancelled Details	2026-04-27 11:02:26 +00:00