🌱

A Starting Point

Where The Journey Begins

The Journey Begins

Musings

The nature of a journey suggests a meandering approach: letting each step reveal a path, understanding what exists and what does not, highlighting the gap. Curiosity regarding the absence suggests direction with an encouraging tug.

The Virtuous Circle

  1. Capabilities become known - What can be done
  2. Limitations are named - What can not
  3. Highlighting the Gap - What is missing
  4. Identifying what is to be done - Curiosity’s encouraging tug
  5. Pause and observation - Return to step 1

The Virtuous Circle needs a starting point. Creating the starting point requires the starting point to be named and described.

The Naming

AIlumina = AI + lumina (Latin: light, illumination)

The name combines “AI” with the Latin word for light.

The Describing

AIlumina: a multi-provider, real-time conversational agent. A working conversational AI with flexible, natural interaction patterns. Purely reactive, responds, but does not remember, converses, but doesn’t learn. Has no continuity of self.

Backend Architecture:

  • Multi-provider support for multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq)
  • Direct HTTP transport layer - no SDK dependencies, full protocol control
  • WebSocket streaming for real-time communication
  • Type-safe implementation
  • Configuration based

Frontend Interaction:

  • Four natural communication modalities:
    • Type + Read (text in, text out)
    • Speak + Read (voice in, text out)
    • Type + Listen (text in, voice out)
    • Speak + Listen (voice in, voice out)
  • Independent Speech Recognition and Text-to-Speech toggle controls
  • State machine for deterministic conversation flow
  • Coordinated SR/TTS synchronization (AI doesn’t hear itself, seamless transitions)
  • Browser-native APIs (Web Speech, Speech Synthesis) - zero backend configuration

Fundamental Absences:

  • No temporal continuity - The system resets between sessions, forgetting everything
  • No persistent memory - Cannot build knowledge over time or remember past interactions
  • No persistent identity - No sense of “I” that persists across conversations
  • No self-observation capability - Cannot reflect on its own operations or behavior
  • No deterministic operations - Pure probabilistic token generation (System 1 thinking only)
  • No tools or functions - Cannot perform reliable, verifiable operations

Implementation

The starting point, once named and described, must be built.

1. Multi-Provider Architecture - Supporting 6+ AI Providers

The Question: How does a single system support multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq) without duplicating code?

The Answer: Abstract base class defining common interface, with provider-specific implementations extending it.

The Architecture:

  1. Base Provider Interface

    • Abstract class defines common contract
    • Properties: agent_name, service_provider, model_name, system_prompt
    • Abstract method: makeApiCall() - each provider implements differently
    • Common functionality in base, provider-specific in extensions
  2. Provider Implementations

    • AnthropicProvider extends BaseServiceProvider
    • OpenAIProvider extends BaseServiceProvider
    • GoogleProvider extends BaseServiceProvider
    • Each handles its own message format, API specifics
  3. Service Factory Pattern

    • Looks up agent configuration
    • Determines service_provider
    • Instantiates appropriate provider class
    • Returns unified ServiceProvider interface
  4. Configuration-Driven Selection

    • agents.json specifies service_provider per agent
    • Different agents can use different providers
    • Switch providers by changing configuration
    • No code changes needed

The Flow:

WebSocket connection
  ↓ agent type from URL
  ↓ look up in agents.json
Agent configuration retrieved
  ↓ service_provider: "ANTHROPIC"
ServiceFactory creates provider
  ↓ new AnthropicProvider(config)
Provider ready to handle requests

Benefits:

  • Flexibility: Easy to add new providers
  • Consistency: Same interface regardless of provider
  • Configuration: Switch providers without code changes
  • Independence: Providers don’t know about each other

2. WebSocket Real-Time Streaming - Instant Response Delivery

The Question: How does the system deliver AI responses in real-time as they’re generated?

The Answer: WebSocket bidirectional communication with streaming support.

The Architecture:

  1. WebSocket Connection Establishment

    • Client connects to /ws/{agent_type}
    • Server validates agent type exists
    • Loads agent configuration
    • Creates provider instance
    • Maintains persistent connection
  2. Message Flow

    • Client sends: user input + message history via WebSocket
    • Server validates and filters messages
    • Calls provider’s makeApiCall() with streaming=true
    • Provider streams chunks back through WebSocket
    • Client receives and displays incrementally
  3. Streaming vs Non-Streaming

    • Controlled by agent’s do_stream configuration flag
    • Streaming: Sends chunks as generated, better UX
    • Non-streaming: Waits for complete response, simpler handling
    • Same WebSocket connection supports both modes
  4. Error Handling

    • Validation errors sent immediately, connection closed
    • Provider errors caught, user-friendly message sent
    • Connection state managed properly on errors

The Flow:

User types message
  ↓ Client sends via WebSocket
Server receives
  ↓ filters message history
  ↓ calls provider.makeApiCall(ws, streaming=true)
Provider streams response
  ↓ sends chunks through WebSocket
Client receives chunks
  ↓ displays incrementally
Response complete
  ↓ waiting for next message

Benefits:

  • Real-time: User sees response as it generates
  • Bidirectional: Single connection for all communication
  • Efficient: No polling, instant updates
  • Flexible: Supports both streaming and batch modes

3. Direct HTTP Transport - Full Protocol Control

The Question: How does the system communicate with AI provider APIs without dependency on official SDKs?

The Answer: Custom HTTP transport layer using native fetch(), implementing provider protocols directly.

The Architecture:

  1. No SDK Dependencies

    • Native fetch() for HTTP requests
    • No @anthropic-ai/sdk, openai, @google/generative-ai packages
    • Direct implementation of provider API protocols
    • Full control over request/response handling
  2. Provider-Specific Transports

    • Each provider has dedicated transport class
    • Handles provider-specific message formatting
    • Manages provider-specific headers, authentication
    • Returns standardized TransportResult interface
  3. Shared Result Interface

    • All transports return: { type: 'streaming' | 'non_streaming', data, raw }
    • Provider differences hidden behind common interface
    • Consumers don’t need provider-specific knowledge
  4. Independent Implementations

    • No base transport class - each fully independent
    • Reduces coupling, easier to modify per provider
    • Provider API changes affect only that transport
    • Clean separation of concerns

The Flow:

Provider needs to make API call
  ↓ selects transport (AnthropicAPITransport)
  ↓ formats messages to provider schema
Transport sends HTTP request
  ↓ fetch(provider_url, { provider_specific_format })
Provider API responds
  ↓ streaming or batch
Transport parses response
  ↓ converts to standard TransportResult
Provider returns to caller

Benefits:

  • Full control: Complete visibility into requests/responses
  • Minimal dependencies: Only native APIs, smaller bundle
  • Debugging: Easy to inspect exact API communication
  • Flexibility: Can customize behavior per provider

Trade-off: Custom maintenance vs SDK convenience

4. JSON-Based Agent Configuration - Declarative Agent Definitions

The Question: How are agents configured without hardcoding in source code?

The Answer: JSON configuration file (agents.json) defining all agent properties declaratively.

The Architecture:

  1. Configuration Structure

    • Single agents.json file
    • Each agent: key, agent_name, service_provider, model_name, system_prompt
    • Optional: do_stream, available_functions, custom_settings, mcp_servers
    • At baseline: available_functions is empty array
  2. Configuration Loading

    • AgentConfigManager reads agents.json on startup
    • Provides getAgentConfig(name) lookup
    • Validates required fields
    • Caches in memory for fast access
  3. Runtime Selection

    • WebSocket URL includes agent type: /ws/AIlumina
    • System looks up “AIlumina” in loaded configurations
    • Instantiates with those specific settings
    • Multiple agents, different configurations, same codebase
  4. Configuration-Driven Behavior

    • System prompt shapes agent personality
    • service_provider selects which AI backend
    • model_name chooses specific model
    • do_stream controls streaming behavior
    • available_functions: empty at baseline (no tools yet)

The Flow:

Server startup
  ↓ reads agents.json
  ↓ loads all agent configurations
Client connects to /ws/AIlumina
  ↓ looks up "AIlumina" in config
Configuration found
  ↓ service_provider: ANTHROPIC
  ↓ model: claude-3-5-sonnet-20241022
  ↓ system_prompt: "You are AIlumina..."
Agent created with those settings

Benefits:

  • Declarative: Agent behavior defined in data, not code
  • Flexible: Change agents without code changes
  • Multiple agents: Single codebase, many configurations
  • Extensible: Ready for future capabilities (available_functions)

5. Natural Interaction Modalities - Flexible Communication

The Question: How does the system support natural, flexible interaction rather than forcing users into a single mode?

The Answer: Four interaction modalities through independent Speech Recognition and Text-to-Speech controls.

The Architecture:

  1. Four Natural Modalities

    • Type + Read: Text in, text out (traditional chat)
    • Speak + Read: Voice in, text out (dictation mode)
    • Type + Listen: Text in, voice out (accessibility/multitasking)
    • Speak + Listen: Voice in, voice out (conversation mode)
  2. Independent Toggle Controls

    • Speech Recognition (SR) toggle - controls input modality
    • Text-to-Speech (TTS) toggle - controls output modality
    • Independent state: SR can be on while TTS off, or vice versa
    • Users switch modes mid-conversation as needs change
  3. Browser-Native APIs

    • Speech Recognition: Web Speech API (window.SpeechRecognition)
    • Text-to-Speech: Speech Synthesis API (window.speechSynthesis)
    • Zero backend configuration required
    • Works entirely in browser
  4. Modality Mixing

    • Start typing, switch to voice input mid-conversation
    • Read responses while working, switch to listening when hands-free
    • Natural transitions match how humans actually communicate
    • System adapts to user’s current context

The Flow:

User enables SR + TTS (Speak + Listen mode)
  ↓ speaks: "What's the weather?"
SR captures speech → text
  ↓ sends to AI
AI responds with text
  ↓ TTS synthesizes to speech
User hears response
  ↓ can toggle to Type + Read anytime

Benefits:

  • Natural: Matches human communication flexibility
  • Accessible: Multiple input/output options
  • Context-adaptive: Switch modes as situation changes
  • Zero configuration: Browser APIs, no backend setup

6. SR/TTS Synchronization - Solving the Coordination Challenge

The Question: When both Speech Recognition and Text-to-Speech are active, how do you prevent the AI from hearing itself and coordinate state transitions?

The Answer: Observer pattern with useRef state tracking to manage interdependent lifecycles without React closure issues.

The Challenges & Solutions:

  1. Feedback Prevention

    • Problem: When TTS speaks, SR hears AI and creates feedback loop
    • Solution: TTS observer stops SR when utterance starts
    • Implementation: Observer listens to TTS “start” event → pause SR
  2. Seamless Restart

    • Problem: SR must restart after TTS completes automatically
    • Solution: TTS observer restarts SR on utterance “end” event
    • Implementation: Observer listens to “end” → restart SR if enabled
  3. Independent State Control

    • Problem: User can toggle SR/TTS independently mid-conversation
    • Solution: Separate boolean flags in state machine context
    • Implementation: speechRecognitionEnabled, speechSynthesisEnabled tracked separately
  4. Browser SR Auto-Restart

    • Problem: Web Speech API auto-restarts SR every ~8 seconds of silence
    • Creates: visible state transitions (listening → ready → listening)
    • Solution: UI displays stable text “Speech recognition active…”, status indicator shows actual SR state
    • User Experience: Continuity despite underlying cycling
  5. React Stale Closures

    • Problem: useEffect closures capture stale state values
    • Impact: TTS observer depending on SR state breaks when SR changes
    • Effect cleanup/recreation destroys TTS synchronization
    • Solution: useRef to track SR state without effect dependencies
    • Implementation: Observer reads current SR state from ref, not closure

The Architecture:

TTS utterance starts
  ↓ TTS observer detects "start" event
  ↓ checks SR ref: is it enabled?
  ↓ if yes: pause SR
AI speaks
  ↓ user hears response
TTS utterance ends
  ↓ TTS observer detects "end" event
  ↓ checks SR ref: is it enabled?
  ↓ if yes: restart SR
SR resumes listening
  ↓ ready for next user input

Benefits:

  • No feedback: AI never hears itself speak
  • Seamless UX: Automatic transitions, no manual intervention
  • Stable: Survives browser SR recycling
  • Independent: Users control SR and TTS separately
  • Robust: No React closure bugs

7. State Machine for Conversation Flow - Deterministic UI State

The Question: How does the UI manage conversation state transitions reliably without race conditions?

The Answer: XState v5 finite state machine with flat hierarchy and explicit transitions.

The Architecture:

  1. Four Core States

    • WAITING: Ready for user input
    • THINKING: Processing request (calling AI)
    • RESPONDING: Streaming/displaying AI response
    • ERROR: Error occurred, can retry
  2. Explicit State Transitions

    • WAITING → THINKING (on SUBMIT_TEXT)
    • THINKING → RESPONDING (on AI_RESPONSE_RECEIVED)
    • RESPONDING → WAITING (on AI_COMPLETE)
    • Any → ERROR (on AI_ERROR)
    • ERROR → THINKING (on retry)
  3. Context for Data

    • messages: conversation history
    • aiResponse: current response being built
    • speechRecognitionEnabled: SR toggle state
    • speechSynthesisEnabled: TTS toggle state
    • Independent boolean flags prevent coupling
  4. Toggle Actions Separate from Flow

    • TOGGLE_SPEECH_RECOGNITION action: update SR flag, don’t change state
    • TOGGLE_SPEECH_SYNTHESIS action: update TTS flag, don’t change state
    • User can toggle modalities without disrupting conversation flow

The Flow:

State: WAITING
  ↓ user submits message
  ↓ event: SUBMIT_TEXT
State: THINKING
  ↓ AI responds
  ↓ event: AI_RESPONSE_RECEIVED
State: RESPONDING
  ↓ response complete
  ↓ event: AI_COMPLETE
State: WAITING

Benefits:

  • Deterministic: Same event in same state always produces same transition
  • Impossible states prevented: Can’t be THINKING and WAITING simultaneously
  • Testable: State transitions can be unit tested
  • Type-safe: TypeScript ensures events and states are valid
  • Flat hierarchy: Simple to understand, no nested complexity

8. Technology Choices - TypeScript, Bun, Browser APIs

The Question: What technology choices enable rapid, deterministic development?

The Answer: TypeScript for type safety, Bun for performance, browser-native APIs for zero config.

TypeScript Over Python:

  • Type safety: Compile-time validation catches errors before runtime
  • Deterministic behavior: Type constraints enforce correctness
  • Unified language: Same language frontend and backend
  • Superior tooling: IDE support, refactoring, intellisense
  • Insight: Consciousness research requires deterministic operations—TypeScript’s type system provides guarantees dynamic typing cannot

Bun Over Node.js/npm:

  • Test execution: 8-9ms (vs 5-10 seconds with Vitest)
  • Package install: ~1 second (vs 30-60 seconds with npm)
  • Native TypeScript: Direct execution, no transpilation overhead
  • Drop-in replacement: Compatible with Node.js ecosystem
  • Development velocity: Near-instant feedback loops

Browser-Native APIs:

  • Speech Recognition: Web Speech API (zero backend setup)
  • Text-to-Speech: Speech Synthesis API (built into browsers)
  • No dependencies: No npm packages, no backend config
  • Wide support: Chrome, Edge (SR), all modern browsers (TTS)
  • Instant availability: Works out of the box

Benefits:

  • Velocity: Sub-10ms test feedback, sub-second installs
  • Reliability: Type safety catches errors early
  • Simplicity: Browser APIs eliminate backend complexity
  • Determinism: TypeScript enforces correctness at compile-time

Evidence

The implementation exists as runnable code:

Backend:

Frontend: