Agent OS MCP/RAG Server

Note

🤖 AI-Assisted Development Infrastructure

This is the infrastructure that powers AI-assisted development on the HoneyHive Python SDK. It’s also a demonstration of dogfooding—using HoneyHive’s own tracing to observe AI development workflows.

Overview

The Agent OS MCP/RAG server is a Model Context Protocol (MCP) server that provides AI coding assistants (like Cursor) with intelligent access to our development standards, workflows, and architectural patterns.

What Problem Does This Solve?

Traditional AI coding assistants face three major challenges:

Context Overload: Reading entire 50KB standard files when they only need 5KB
Workflow Violations: Skipping critical phases (e.g., jumping to coding without planning)
No Observability: Can’t trace what standards AI is actually using or how decisions are made

Our Solution:

90% Context Reduction: RAG engine with semantic search (50KB → 5KB)
Phase Gating: Workflow engine prevents AI from skipping steps
Full Observability: HoneyHive tracing on all AI development operations

What is Agent OS?

Agent OS is a spec-driven development methodology created by Brian Casel (Builder Methods). It provides a structured approach to AI-assisted software development through three layers of context stored as markdown files:

Layer 1: Standards (``~/.agent-os/standards/``): Your tech stack, code style, and best practices that apply across all projects.
Layer 2: Product (``.agent-os/product/``): Mission, roadmap, architecture decisions, and product-specific context.
Layer 3: Specs (``.agent-os/specs/YYYY-MM-DD-feature-name/``): Individual feature specifications with requirements, technical design, and task breakdowns.

Traditional Agent OS Approach:

AI coding assistants (like Cursor, Claude Code) directly read these markdown files using tools like codebase_search, read_file, and grep to understand your development standards and execute workflows like:

plan-product - Analyze product and create roadmap
create-spec - Generate feature specifications
execute-tasks - Implement features following specs

Learn More: https://buildermethods.com/agent-os

Our Evolution: From Builder Methods to MCP/RAG

Phase 1: Builder Methods Agent OS (Markdown Foundation)

We started with Agent OS as created by Brian Casel, implementing the traditional approach:

What We Adopted:

✅ Three-layer context architecture (Standards, Product, Specs)
✅ Markdown-based documentation system
✅ Spec-driven development methodology
✅ Command-based workflows (plan-product, create-spec, execute-tasks)

How It Worked:

AI coding assistants directly read markdown files:

User: "What are our git safety rules?"

AI: Uses codebase_search(".agent-os/standards/")
    Reads entire git-safety-rules.md (2,500 lines)
    Extracts relevant sections manually

This foundation was excellent, providing structure and consistency. However, as our codebase and standards grew, we discovered scaling challenges.

Phase 2: HoneyHive LLM Workflow Engineering

We extended Agent OS with our own LLM Workflow Engineering methodology (documented in .agent-os/standards/ai-assistant/LLM-WORKFLOW-ENGINEERING-METHODOLOGY.md):

Our Innovations:

🔧 Command Language Interface

Binding commands like 🛑 EXECUTE-NOW, 📊 QUANTIFY-RESULTS, 🎯 NEXT-MANDATORY that create non-negotiable obligations for AI execution.

🏗️ Three-Tier Architecture

Tier 1: Side-Loaded (≤100 lines): Automatic injection for systematic execution
Tier 2: Active Read (200-500 lines): On-demand comprehensive context
Tier 3: Output (Unlimited): Generated deliverables

🚨 11 Automated Pre-Commit Hooks

Quality gates enforcing: formatting, linting, tests, documentation compliance, no-mock policy, etc.

📋 Phase Gating with Evidence Requirements

Each workflow phase requires quantified evidence before progression (e.g., “test file created”, “coverage ≥90%”).

🎯 Quality Targets

100% test pass rate + 90%+ coverage + 10.0/10 Pylint + 0 MyPy errors (non-negotiable).

Example Workflow (V3 Test Generation):

# Phase 1: Analysis
🛑 EXECUTE-NOW: grep -n "^def\|^class" target_file.py
📊 COUNT-AND-DOCUMENT: Functions and classes with signatures
🎯 NEXT-MANDATORY: phases/2/dependency-analysis.md

# Evidence Required:
- Function count: <number>
- Class count: <number>
- Complexity assessment: <high/medium/low>

Results:

✅ 22% → 80%+ success rate (3.6x improvement)
✅ Systematic quality enforcement via automation
✅ Evidence-based validation preventing vague claims

But New Challenges Emerged:

❌ Context Waste: AI reads 50KB files when only 5KB needed for current task.
❌ No Programmatic Enforcement: Phase gating relies on AI compliance, can be skipped.
❌ Zero Observability: No way to trace which standards AI consulted or how decisions were made.
❌ Manual Discovery: AI must search for relevant standards each time.

Phase 3: MCP/RAG Innovation (This Implementation)

We evolved our LLM Workflow Engineering approach by building an MCP server with RAG, transforming standards access from file-based to API-based:

Builder Methods Foundation + Our Innovations + MCP/RAG = Complete Solution

✅ 90% Context Reduction via RAG

Semantic search returns only relevant chunks (5KB vs 50KB), preserving Builder Methods’ three-layer structure.

User: "What are our git safety rules?"

AI: Uses mcp_agent-os-rag_search_standards(
      query="git safety rules forbidden operations",
      n_results=5
    )

Returns: 3 relevant chunks (840 tokens) instead of entire file (12,000 tokens)

✅ Architectural Phase Gating

Workflow engine programmatically enforces our phase-gating methodology, making it impossible to skip steps.

# Cannot advance to Phase 2 without Phase 1 evidence
result = workflow_engine.complete_phase(
    session_id="abc-123",
    phase=1,
    evidence={
        "test_file_created": True,
        "framework_decision": "pytest"
    }
)

# Returns Phase 2 requirements ONLY if evidence validates

✅ Full Observability (Dogfooding HoneyHive)

Every RAG query and workflow operation traced, demonstrating our own product in action.

✅ Intelligent Filtering

Search by phase number, tags, or semantic meaning from Builder Methods’ structured markdown.

✅ Hot Reload

File watcher automatically rebuilds index when standards change.

The Complete Evolution:

Aspect	Builder Methods Agent OS	LLM Workflow Engineering	MCP/RAG Server
Foundation	3-layer context (Standards/Product/Specs)	Command language + Phase gating	Programmatic API access
Standards Access	Direct file reading	Same (file-based)	Semantic search (90% reduction)
Workflow Enforcement	Manual AI compliance	Evidence-based validation	Architectural phase gating
Context Efficiency	Read entire files	Tier-based sizing	RAG chunk retrieval
Observability	None	Manual tracking	Full HoneyHive tracing
Quality Gates	None	11 pre-commit hooks	Same (inherited)
AI Interface	Tool calls (search, read)	Command language	MCP tools (5 tools)

Credit Where Due:

Builder Methods (Brian Casel): Three-layer architecture, spec-driven methodology, markdown standards
HoneyHive Engineering: LLM Workflow Engineering, command language, phase gating, quality automation
This Implementation: MCP/RAG server combining both approaches with programmatic enforcement and observability

Architecture

The MCP server consists of four core components:

RAG Engine (`rag_engine.py`)

Purpose: Semantic search over Agent OS standards with metadata filtering.

Technology:

LanceDB: Vector database (migrated from ChromaDB for better filtering)
sentence-transformers: Local embeddings (all-MiniLM-L6-v2 model)
Grep Fallback: When vector search unavailable, falls back to grep

Key Features:

90%+ retrieval accuracy on standard queries
<100ms average latency
Metadata filtering (phase, tags, file path)
LRU cache with configurable TTL (5-minute default)
Automatic index rebuilding

Example Query:

from mcp_servers.rag_engine import RAGEngine

engine = RAGEngine(index_path, standards_path)

# Search with semantic meaning
result = engine.search(
    query="git safety rules forbidden operations",
    n_results=5,
    filters={"phase": 8}  # Only Phase 8 content
)

Workflow Engine (`workflow_engine.py`)

Purpose: Phase-gated workflow execution with checkpoint validation.

Workflows Supported:

test_generation_v3: 8-phase TDD test generation workflow
production_code_v2: Production code generation with quality gates

Phase Gating:

Phase 1 → Evidence → Phase 2 → Evidence → Phase 3 → ...

Cannot advance to Phase N+1 without completing Phase N evidence requirements.

Checkpoint Validation:

Each phase defines required evidence (e.g., “test file must exist”, “coverage must be 90%+”). The workflow engine validates evidence before allowing progression.

Example:

from mcp_servers.workflow_engine import WorkflowEngine

engine = WorkflowEngine(state_manager, rag_engine)

# Start workflow
state = engine.start_workflow(
    workflow_type="test_generation_v3",
    target_file="tests/unit/test_new_feature.py"
)

# Complete phase with evidence
result = engine.complete_phase(
    session_id=state.session_id,
    phase=1,
    evidence={
        "test_file_created": True,
        "framework_decision": "pytest with fixtures"
    }
)

State Manager (`state_manager.py`)

Purpose: Workflow state persistence and session lifecycle management.

Features:

JSON-based state persistence in .agent-os/workflow_sessions/
Session expiration (30-day default)
Automatic garbage collection of expired sessions
State validation and integrity checking

Chunker (`chunker.py`)

Purpose: Markdown document chunking for RAG indexing.

Chunking Strategy:

Size: 100-500 tokens per chunk (optimal for semantic search)
Structure: Respects markdown headers (keeps sections together)
Metadata: Extracts phase numbers, tags, and section titles
Overlap: Maintains context continuity between chunks

Getting Started

Prerequisites

Cursor IDE with MCP support
Python 3.11+ with python-sdk virtual environment
Agent OS standards in .agent-os/standards/

Building the RAG Index

Before using the MCP server, build the vector index:

cd /Users/josh/src/github.com/honeyhiveai/python-sdk

# Activate project venv
source python-sdk/bin/activate

# Install MCP server dependencies
pip install -r .agent-os/mcp_servers/requirements.txt

# Build the index
python .agent-os/scripts/build_rag_index.py

Output:

🏗️  Building RAG index from Agent OS standards...
📁 Standards path: .agent-os/standards
💾 Index path: .agent-os/rag_index

📄 Processing 47 markdown files...
✅ Created 342 chunks
🎯 90.2% retrieval accuracy on test queries
⚡ Average query time: 87ms

✅ Index built successfully!

Enabling in Cursor

The MCP server is already configured in .cursor/mcp.json:

{
  "mcpServers": {
    "agent-os-rag": {
      "command": "/Users/josh/src/github.com/honeyhiveai/python-sdk/python-sdk/bin/python",
      "args": [
        "/Users/josh/src/github.com/honeyhiveai/python-sdk/.agent-os/run_mcp_server.py"
      ],
      "env": {
        "HONEYHIVE_ENABLED": "true"
      },
      "autoApprove": [
        "search_standards",
        "get_current_phase",
        "get_workflow_state"
      ]
    }
  }
}

To Enable:

Open Cursor Settings → MCP
Locate agent-os-rag server
Enable the server
Reload Cursor window

Using the MCP Tools

The MCP server provides 5 tools for AI assistants:

1. search_standards

Semantic search over Agent OS standards with filtering.

Example:

User: "What are our git safety rules?"

AI uses: mcp_agent-os-rag_search_standards(
  query="git safety rules forbidden operations",
  n_results=5
)

Returns: Relevant chunks from git-safety-rules.md

Filters:

phase: Filter by workflow phase number (1-8)
tags: Filter by metadata tags

2. start_workflow

Initialize a phase-gated workflow session.

Example:

User: "Generate tests for config/dsl/compiler.py"

AI uses: mcp_agent-os-rag_start_workflow(
  workflow_type="test_generation_v3",
  target_file="tests/unit/config/dsl/test_compiler.py"
)

Returns: Phase 1 requirements and session ID

3. get_current_phase

Retrieve current phase requirements and artifacts from previous phases.

4. complete_phase

Submit evidence and attempt to advance to next phase.

Example:

AI uses: mcp_agent-os-rag_complete_phase(
  session_id="abc-123",
  phase=1,
  evidence={
    "test_file_created": True,
    "framework_decision": "pytest"
  }
)

Returns: Phase 2 requirements if evidence validates

5. get_workflow_state

Query complete workflow state for debugging/resume capability.

Development

Running MCP Server Tests

MCP server tests have separate dependencies from the main SDK and are excluded from the main test suite:

# Activate venv with MCP dependencies
source python-sdk/bin/activate
pip install -r .agent-os/mcp_servers/requirements.txt

# Run MCP server tests only
pytest tests/unit/mcp_servers/ -v

Test Coverage:

28 comprehensive unit tests
10.0/10 Pylint score
Full type annotations (MyPy clean)
Tests for all 4 core components

Why Separate Tests?

The MCP server is an independent component with its own dependency tree:

MCP Dependencies (not in main SDK):

lancedb>=0.3.0 - Vector database
sentence-transformers>=2.0.0 - Local embeddings
watchdog>=3.0.0 - File watching
mcp>=1.0.0 - Model Context Protocol

Rationale:

✅ No dependency bloat in main SDK
✅ Faster main SDK tests (no vector DB initialization)
✅ Clear separation between SDK and tooling
✅ Independent versioning for MCP components

Adding New Tools

To add a new MCP tool:

Define the tool function in agent_os_rag.py
Add @trace decorator for observability
Register with MCP server in create_server()
Add to autoApprove in .cursor/mcp.json (if safe)
Write tests in tests/unit/mcp_servers/

Example:

@tool_trace
@server.call_tool()
async def new_tool(query: str) -> Sequence[types.TextContent]:
    """New tool description."""
    # Enrich span with input
    enrich_span({"query": query})

    # Tool logic here
    result = do_something(query)

    # Enrich span with output
    enrich_span({"result": result})

    return [types.TextContent(type="text", text=result)]

Hot Reload

The MCP server includes a file watcher that automatically rebuilds the RAG index when standards change:

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class AgentOSFileWatcher(FileSystemEventHandler):
    def on_modified(self, event):
        if event.src_path.endswith('.md'):
            # Debounce and rebuild index
            self._schedule_rebuild()

In Development:

Edit any .agent-os/standards/*.md file
Index automatically rebuilds in background
New content available in ~2-3 seconds

Observability (Dogfooding HoneyHive)

Every MCP tool operation is traced with HoneyHive instrumentation, demonstrating dogfooding of our own product.

Instrumentation Pattern

All tools use the @trace decorator with span enrichment:

from honeyhive import HoneyHiveTracer, trace, enrich_span
from honeyhive.models import EventType

# Initialize tracer once
tracer = HoneyHiveTracer.init(
    api_key=os.getenv("HH_API_KEY"),
    project="your-project-here",
    source="agent-os-mcp-server",
    verbose=True
)

# Wrap tool with tracing
@trace(tracer=tracer, event_type=EventType.tool)
async def search_standards(query: str, n_results: int):
    # Enrich span with inputs
    enrich_span({
        "query": query,
        "n_results": n_results,
        "filters": filters
    })

    # Execute RAG search
    result = rag_engine.search(query, n_results, filters)

    # Enrich span with outputs
    enrich_span({
        "chunks_returned": len(result.chunks),
        "retrieval_method": result.retrieval_method,
        "query_time_ms": result.query_time_ms
    })

    return result

Viewing Traces

Navigate to HoneyHive dashboard
Select project: your-project-here
Filter by source: agent-os-mcp-server

Trace Attributes:

query: Semantic search query
n_results: Number of chunks requested
filters: Metadata filters applied
chunks_returned: Actual chunks returned
retrieval_method: “vector” or “grep_fallback”
query_time_ms: RAG query latency
session_id: Workflow session ID (for workflow tools)
phase: Current phase number

Span Enrichment Examples

Search Tool:

{
  "query": "git safety rules forbidden operations",
  "n_results": 5,
  "filters": null,
  "chunks_returned": 3,
  "retrieval_method": "vector",
  "query_time_ms": 87,
  "total_tokens": 840
}

Workflow Tool:

{
  "session_id": "abc-123-def-456",
  "workflow_type": "test_generation_v3",
  "target_file": "tests/unit/test_feature.py",
  "current_phase": 2,
  "phase_content_tokens": 1200
}

Troubleshooting

Import Errors

Problem: ModuleNotFoundError: No module named 'lancedb'

Solution: Install MCP server dependencies:

pip install -r .agent-os/mcp_servers/requirements.txt

Why: MCP server has separate dependencies from main SDK.

Index Rebuild Issues

Problem: RAG index not updating after standards changes.

Solutions:

Manual Rebuild:

python .agent-os/scripts/build_rag_index.py

Check File Watcher: Look for errors in MCP server logs (Cursor DevTools).

Clear Index:

rm -rf .agent-os/rag_index
python .agent-os/scripts/build_rag_index.py

Credential Loading

Problem: HoneyHive traces not appearing in dashboard.

Cause: MCP server not loading credentials from .env.

Solution: Verify .env has correct format:

export HH_API_KEY="your-key-here"
export HH_PROJECT="your-project-here"

How Credentials Load:

.cursor/mcp.json → Launches run_mcp_server.py
run_mcp_server.py → Parses .env and loads into os.environ
agent_os_rag.py → Reads from os.getenv()

Debug:

Check MCP server logs in Cursor DevTools for:

DEBUG: HH_API_KEY=SET
DEBUG: HONEYHIVE_PROJECT=your-project-here
🍯 HoneyHive tracing enabled for dogfooding

No Traces Appearing

Problem: MCP server running but no traces in HoneyHive.

Checklist:

✅ HONEYHIVE_ENABLED="true" in .cursor/mcp.json env
✅ Valid HH_API_KEY and HH_PROJECT in .env
✅ Tracer initialized successfully (check logs)
✅ Using correct project in HoneyHive dashboard

Debugging:

Enable verbose logging in agent_os_rag.py:

tracer = HoneyHiveTracer.init(
    verbose=True  # Already enabled
)

Agent OS MCP/RAG Server

Overview

What is Agent OS?

Our Evolution: From Builder Methods to MCP/RAG

Phase 1: Builder Methods Agent OS (Markdown Foundation)

Phase 2: HoneyHive LLM Workflow Engineering

Phase 3: MCP/RAG Innovation (This Implementation)

Architecture

RAG Engine (rag_engine.py)

Workflow Engine (workflow_engine.py)

State Manager (state_manager.py)

Chunker (chunker.py)

Getting Started

Prerequisites

Building the RAG Index

Enabling in Cursor

Using the MCP Tools

1. search_standards

2. start_workflow

3. get_current_phase

4. complete_phase

5. get_workflow_state

Development

Running MCP Server Tests

Why Separate Tests?

Adding New Tools

Hot Reload

Observability (Dogfooding HoneyHive)

Instrumentation Pattern

Viewing Traces

Span Enrichment Examples

Troubleshooting

Import Errors

Index Rebuild Issues

Credential Loading

No Traces Appearing

See Also

RAG Engine (`rag_engine.py`)

Workflow Engine (`workflow_engine.py`)

State Manager (`state_manager.py`)

Chunker (`chunker.py`)