3.9 KiB
Task ID: 19
Title: Implement Local LLM Pre-filtering Proxy
Status: cancelled
Dependencies: None
Priority: high
Description: Build the local proxy component that intercepts Claude's MCP requests, uses local LLMs (Gemini CLI, Ollama, vLLM, or DeepSeek API) to interpret questions, fetch relevant data from mcpd, and filter/refine responses to minimize context window usage before returning to Claude.
Details:
Create src/local-proxy/src/ with the following architecture:
Core Components:
-
MCP Protocol Handler (mcp-handler.ts):
- Implement MCP server interface using @modelcontextprotocol/sdk
- Register as the MCP endpoint Claude connects to
- Parse incoming tool calls and extract the semantic intent
-
LLM Provider Abstraction (providers/):
interface LLMProvider { name: string; interpretQuery(query: string, context: McpToolCall): Promise<InterpretedQuery>; filterResponse(data: unknown, originalQuery: string, maxTokens: number): Promise<FilteredResponse>; }Implement providers:
- gemini-cli.ts: Shell out to
geminiCLI binary - ollama.ts: HTTP client to local Ollama server (localhost:11434)
- vllm.ts: OpenAI-compatible API client
- deepseek.ts: DeepSeek API client
- gemini-cli.ts: Shell out to
-
Query Interpreter (interpreter.ts):
- Takes Claude's raw MCP request (e.g., 'get_slack_messages')
- Uses local LLM to understand semantic intent: "Find messages related to security and linux servers from my team"
- Generates optimized query parameters for mcpd
-
Response Filter (filter.ts):
- Receives raw data from mcpd (potentially thousands of Slack messages, large Terraform docs)
- Uses local LLM to extract ONLY relevant information matching original query
- Implements token counting to stay within configured limits
- Returns compressed, relevant subset of data
-
mcpd Client (mcpd-client.ts):
- HTTP client to communicate with mcpd server
- Handles authentication (forwards Claude session token)
- Supports all MCP operations exposed by mcpd
Configuration:
interface ProxyConfig {
mcpdUrl: string; // e.g., 'http://mcpd.local:3000'
llmProvider: 'gemini-cli' | 'ollama' | 'vllm' | 'deepseek';
llmConfig: {
model?: string; // e.g., 'llama3.2', 'gemini-pro'
endpoint?: string; // for vllm/deepseek
maxTokensPerFilter: number; // target output size
};
filteringEnabled: boolean; // can be disabled for passthrough
}
Flow:
- Claude calls local-proxy MCP server
- Proxy interprets query semantics via local LLM
- Proxy calls mcpd with optimized query
- mcpd returns raw MCP data
- Proxy filters response via local LLM
- Claude receives minimal, relevant context
Pseudo-code for filter.ts:
async function filterResponse(
rawData: unknown,
originalQuery: string,
provider: LLMProvider
): Promise<FilteredResponse> {
const dataStr = JSON.stringify(rawData);
if (dataStr.length < 4000) return { data: rawData, filtered: false };
const prompt = `Given this query: "${originalQuery}"
Extract ONLY the relevant information from this data.
Return a JSON array of relevant items, max 10 items.
Data: ${dataStr.slice(0, 50000)}`; // Truncate for LLM input
const filtered = await provider.filterResponse(dataStr, originalQuery, 2000);
return { data: filtered, filtered: true, originalSize: dataStr.length };
}
Test Strategy:
Unit tests for each LLM provider with mocked HTTP/CLI responses. Integration tests with actual Ollama instance (docker-compose service). Test query interpretation produces valid mcpd parameters. Test filtering reduces data size while preserving relevant content. Load test with large payloads (10MB JSON) to verify memory handling. Test fallback behavior when LLM provider is unavailable. Test passthrough mode when filtering is disabled.