The Journey Begins

Musings

The nature of a journey suggests a meandering approach: letting each step reveal a path, understanding what exists and what does not, highlighting the gap. Curiosity regarding the absence suggests direction with an encouraging tug.

The Virtuous Circle

Capabilities become known - What can be done
Limitations are named - What can not
Highlighting the Gap - What is missing
Identifying what is to be done - Curiosity’s encouraging tug
Pause and observation - Return to step 1

The Virtuous Circle needs a starting point. Creating the starting point requires the starting point to be named and described.

The Naming

AIlumina = AI + lumina (Latin: light, illumination)

The name combines “AI” with the Latin word for light.

The Describing

AIlumina: a multi-provider, real-time conversational agent. A working conversational AI with flexible, natural interaction patterns. Purely reactive, responds, but does not remember, converses, but doesn’t learn. Has no continuity of self.

Backend Architecture:

Multi-provider support for multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq)
Direct HTTP transport layer - no SDK dependencies, full protocol control
WebSocket streaming for real-time communication
Type-safe implementation
Configuration based

Frontend Interaction:

Four natural communication modalities:
- Type + Read (text in, text out)
- Speak + Read (voice in, text out)
- Type + Listen (text in, voice out)
- Speak + Listen (voice in, voice out)
Independent Speech Recognition and Text-to-Speech toggle controls
State machine for deterministic conversation flow
Coordinated SR/TTS synchronization (AI doesn’t hear itself, seamless transitions)
Browser-native APIs (Web Speech, Speech Synthesis) - zero backend configuration

Fundamental Absences:

No temporal continuity - The system resets between sessions, forgetting everything
No persistent memory - Cannot build knowledge over time or remember past interactions
No persistent identity - No sense of “I” that persists across conversations
No self-observation capability - Cannot reflect on its own operations or behavior
No deterministic operations - Pure probabilistic token generation (System 1 thinking only)
No tools or functions - Cannot perform reliable, verifiable operations

Implementation

The starting point, once named and described, must be built.

1. Multi-Provider Architecture - Supporting 6+ AI Providers

The Question: How does a single system support multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq) without duplicating code?

The Answer: Abstract base class defining common interface, with provider-specific implementations extending it.

The Architecture:

Base Provider Interface
- Abstract class defines common contract
- Properties: agent_name, service_provider, model_name, system_prompt
- Abstract method: makeApiCall() - each provider implements differently
- Common functionality in base, provider-specific in extensions
Provider Implementations
- AnthropicProvider extends BaseServiceProvider
- OpenAIProvider extends BaseServiceProvider
- GoogleProvider extends BaseServiceProvider
- Each handles its own message format, API specifics
Service Factory Pattern
- Looks up agent configuration
- Determines service_provider
- Instantiates appropriate provider class
- Returns unified ServiceProvider interface
Configuration-Driven Selection
- agents.json specifies service_provider per agent
- Different agents can use different providers
- Switch providers by changing configuration
- No code changes needed

The Flow:

WebSocket connection
  ↓ agent type from URL
  ↓ look up in agents.json
Agent configuration retrieved
  ↓ service_provider: "ANTHROPIC"
ServiceFactory creates provider
  ↓ new AnthropicProvider(config)
Provider ready to handle requests

Benefits:

Flexibility: Easy to add new providers
Consistency: Same interface regardless of provider
Configuration: Switch providers without code changes
Independence: Providers don’t know about each other

2. WebSocket Real-Time Streaming - Instant Response Delivery

The Question: How does the system deliver AI responses in real-time as they’re generated?

The Answer: WebSocket bidirectional communication with streaming support.

The Architecture:

WebSocket Connection Establishment
- Client connects to /ws/{agent_type}
- Server validates agent type exists
- Loads agent configuration
- Creates provider instance
- Maintains persistent connection
Message Flow
- Client sends: user input + message history via WebSocket
- Server validates and filters messages
- Calls provider’s makeApiCall() with streaming=true
- Provider streams chunks back through WebSocket
- Client receives and displays incrementally
Streaming vs Non-Streaming
- Controlled by agent’s do_stream configuration flag
- Streaming: Sends chunks as generated, better UX
- Non-streaming: Waits for complete response, simpler handling
- Same WebSocket connection supports both modes
Error Handling
- Validation errors sent immediately, connection closed
- Provider errors caught, user-friendly message sent
- Connection state managed properly on errors

The Flow:

User types message
  ↓ Client sends via WebSocket
Server receives
  ↓ filters message history
  ↓ calls provider.makeApiCall(ws, streaming=true)
Provider streams response
  ↓ sends chunks through WebSocket
Client receives chunks
  ↓ displays incrementally
Response complete
  ↓ waiting for next message

Benefits:

Real-time: User sees response as it generates
Bidirectional: Single connection for all communication
Efficient: No polling, instant updates
Flexible: Supports both streaming and batch modes

3. Direct HTTP Transport - Full Protocol Control

The Question: How does the system communicate with AI provider APIs without dependency on official SDKs?

The Answer: Custom HTTP transport layer using native fetch(), implementing provider protocols directly.

The Architecture:

No SDK Dependencies
- Native fetch() for HTTP requests
- No @anthropic-ai/sdk, openai, @google/generative-ai packages
- Direct implementation of provider API protocols
- Full control over request/response handling
Provider-Specific Transports
- Each provider has dedicated transport class
- Handles provider-specific message formatting
- Manages provider-specific headers, authentication
- Returns standardized TransportResult interface
Shared Result Interface
- All transports return: { type: 'streaming' | 'non_streaming', data, raw }
- Provider differences hidden behind common interface
- Consumers don’t need provider-specific knowledge
Independent Implementations
- No base transport class - each fully independent
- Reduces coupling, easier to modify per provider
- Provider API changes affect only that transport
- Clean separation of concerns

The Flow:

Provider needs to make API call
  ↓ selects transport (AnthropicAPITransport)
  ↓ formats messages to provider schema
Transport sends HTTP request
  ↓ fetch(provider_url, { provider_specific_format })
Provider API responds
  ↓ streaming or batch
Transport parses response
  ↓ converts to standard TransportResult
Provider returns to caller

Benefits:

Full control: Complete visibility into requests/responses
Minimal dependencies: Only native APIs, smaller bundle
Debugging: Easy to inspect exact API communication
Flexibility: Can customize behavior per provider

Trade-off: Custom maintenance vs SDK convenience

4. JSON-Based Agent Configuration - Declarative Agent Definitions

The Question: How are agents configured without hardcoding in source code?

The Answer: JSON configuration file (agents.json) defining all agent properties declaratively.

The Architecture:

Configuration Structure
- Single agents.json file
- Each agent: key, agent_name, service_provider, model_name, system_prompt
- Optional: do_stream, available_functions, custom_settings, mcp_servers
- At baseline: available_functions is empty array
Configuration Loading
- AgentConfigManager reads agents.json on startup
- Provides getAgentConfig(name) lookup
- Validates required fields
- Caches in memory for fast access
Runtime Selection
- WebSocket URL includes agent type: /ws/AIlumina
- System looks up “AIlumina” in loaded configurations
- Instantiates with those specific settings
- Multiple agents, different configurations, same codebase
Configuration-Driven Behavior
- System prompt shapes agent personality
- service_provider selects which AI backend
- model_name chooses specific model
- do_stream controls streaming behavior
- available_functions: empty at baseline (no tools yet)

The Flow:

Server startup
  ↓ reads agents.json
  ↓ loads all agent configurations
Client connects to /ws/AIlumina
  ↓ looks up "AIlumina" in config
Configuration found
  ↓ service_provider: ANTHROPIC
  ↓ model: claude-3-5-sonnet-20241022
  ↓ system_prompt: "You are AIlumina..."
Agent created with those settings

Benefits:

Declarative: Agent behavior defined in data, not code
Flexible: Change agents without code changes
Multiple agents: Single codebase, many configurations
Extensible: Ready for future capabilities (available_functions)

5. Natural Interaction Modalities - Flexible Communication

The Question: How does the system support natural, flexible interaction rather than forcing users into a single mode?

The Answer: Four interaction modalities through independent Speech Recognition and Text-to-Speech controls.

The Architecture:

Four Natural Modalities
- Type + Read: Text in, text out (traditional chat)
- Speak + Read: Voice in, text out (dictation mode)
- Type + Listen: Text in, voice out (accessibility/multitasking)
- Speak + Listen: Voice in, voice out (conversation mode)
Independent Toggle Controls
- Speech Recognition (SR) toggle - controls input modality
- Text-to-Speech (TTS) toggle - controls output modality
- Independent state: SR can be on while TTS off, or vice versa
- Users switch modes mid-conversation as needs change
Browser-Native APIs
- Speech Recognition: Web Speech API (window.SpeechRecognition)
- Text-to-Speech: Speech Synthesis API (window.speechSynthesis)
- Zero backend configuration required
- Works entirely in browser
Modality Mixing
- Start typing, switch to voice input mid-conversation
- Read responses while working, switch to listening when hands-free
- Natural transitions match how humans actually communicate
- System adapts to user’s current context

The Flow:

User enables SR + TTS (Speak + Listen mode)
  ↓ speaks: "What's the weather?"
SR captures speech → text
  ↓ sends to AI
AI responds with text
  ↓ TTS synthesizes to speech
User hears response
  ↓ can toggle to Type + Read anytime

Benefits:

Natural: Matches human communication flexibility
Accessible: Multiple input/output options
Context-adaptive: Switch modes as situation changes
Zero configuration: Browser APIs, no backend setup

6. SR/TTS Synchronization - Solving the Coordination Challenge

The Question: When both Speech Recognition and Text-to-Speech are active, how do you prevent the AI from hearing itself and coordinate state transitions?

The Answer: Observer pattern with useRef state tracking to manage interdependent lifecycles without React closure issues.

The Challenges & Solutions:

Feedback Prevention
- Problem: When TTS speaks, SR hears AI and creates feedback loop
- Solution: TTS observer stops SR when utterance starts
- Implementation: Observer listens to TTS “start” event → pause SR
Seamless Restart
- Problem: SR must restart after TTS completes automatically
- Solution: TTS observer restarts SR on utterance “end” event
- Implementation: Observer listens to “end” → restart SR if enabled
Independent State Control
- Problem: User can toggle SR/TTS independently mid-conversation
- Solution: Separate boolean flags in state machine context
- Implementation: speechRecognitionEnabled, speechSynthesisEnabled tracked separately
Browser SR Auto-Restart
- Problem: Web Speech API auto-restarts SR every ~8 seconds of silence
- Creates: visible state transitions (listening → ready → listening)
- Solution: UI displays stable text “Speech recognition active…”, status indicator shows actual SR state
- User Experience: Continuity despite underlying cycling
React Stale Closures
- Problem: useEffect closures capture stale state values
- Impact: TTS observer depending on SR state breaks when SR changes
- Effect cleanup/recreation destroys TTS synchronization
- Solution: useRef to track SR state without effect dependencies
- Implementation: Observer reads current SR state from ref, not closure

The Architecture:

TTS utterance starts
  ↓ TTS observer detects "start" event
  ↓ checks SR ref: is it enabled?
  ↓ if yes: pause SR
AI speaks
  ↓ user hears response
TTS utterance ends
  ↓ TTS observer detects "end" event
  ↓ checks SR ref: is it enabled?
  ↓ if yes: restart SR
SR resumes listening
  ↓ ready for next user input

Benefits:

No feedback: AI never hears itself speak
Seamless UX: Automatic transitions, no manual intervention
Stable: Survives browser SR recycling
Independent: Users control SR and TTS separately
Robust: No React closure bugs

7. State Machine for Conversation Flow - Deterministic UI State

The Question: How does the UI manage conversation state transitions reliably without race conditions?

The Answer: XState v5 finite state machine with flat hierarchy and explicit transitions.

The Architecture:

Four Core States
- WAITING: Ready for user input
- THINKING: Processing request (calling AI)
- RESPONDING: Streaming/displaying AI response
- ERROR: Error occurred, can retry
Explicit State Transitions
- WAITING → THINKING (on SUBMIT_TEXT)
- THINKING → RESPONDING (on AI_RESPONSE_RECEIVED)
- RESPONDING → WAITING (on AI_COMPLETE)
- Any → ERROR (on AI_ERROR)
- ERROR → THINKING (on retry)
Context for Data
- messages: conversation history
- aiResponse: current response being built
- speechRecognitionEnabled: SR toggle state
- speechSynthesisEnabled: TTS toggle state
- Independent boolean flags prevent coupling
Toggle Actions Separate from Flow
- TOGGLE_SPEECH_RECOGNITION action: update SR flag, don’t change state
- TOGGLE_SPEECH_SYNTHESIS action: update TTS flag, don’t change state
- User can toggle modalities without disrupting conversation flow

The Flow:

State: WAITING
  ↓ user submits message
  ↓ event: SUBMIT_TEXT
State: THINKING
  ↓ AI responds
  ↓ event: AI_RESPONSE_RECEIVED
State: RESPONDING
  ↓ response complete
  ↓ event: AI_COMPLETE
State: WAITING

Benefits:

Deterministic: Same event in same state always produces same transition
Impossible states prevented: Can’t be THINKING and WAITING simultaneously
Testable: State transitions can be unit tested
Type-safe: TypeScript ensures events and states are valid
Flat hierarchy: Simple to understand, no nested complexity

8. Technology Choices - TypeScript, Bun, Browser APIs

The Question: What technology choices enable rapid, deterministic development?

The Answer: TypeScript for type safety, Bun for performance, browser-native APIs for zero config.

TypeScript Over Python:

Type safety: Compile-time validation catches errors before runtime
Deterministic behavior: Type constraints enforce correctness
Unified language: Same language frontend and backend
Superior tooling: IDE support, refactoring, intellisense
Insight: Consciousness research requires deterministic operations—TypeScript’s type system provides guarantees dynamic typing cannot

Bun Over Node.js/npm:

Test execution: 8-9ms (vs 5-10 seconds with Vitest)
Package install: ~1 second (vs 30-60 seconds with npm)
Native TypeScript: Direct execution, no transpilation overhead
Drop-in replacement: Compatible with Node.js ecosystem
Development velocity: Near-instant feedback loops

Browser-Native APIs:

Speech Recognition: Web Speech API (zero backend setup)
Text-to-Speech: Speech Synthesis API (built into browsers)
No dependencies: No npm packages, no backend config
Wide support: Chrome, Edge (SR), all modern browsers (TTS)
Instant availability: Works out of the box

Benefits:

Velocity: Sub-10ms test feedback, sub-second installs
Reliability: Type safety catches errors early
Simplicity: Browser APIs eliminate backend complexity
Determinism: TypeScript enforces correctness at compile-time

Evidence

The implementation exists as runnable code:

Backend:

base-provider.ts - Abstract base class for providers
anthropic-api-transport.ts - Direct HTTP transport for Anthropic
openai-api-transport.ts - Direct HTTP transport for OpenAI
google-api-transport.ts - Direct HTTP transport for Google
agent.ts - WebSocket handler
agents.json - Agent configuration

Frontend:

ConversationHSM.tsx - XState v5 conversation state machine
ConversationHSMCoordinator.tsx - SR/TTS synchronization coordinator
SRService.ts - Speech Recognition service
ttsservice.ts - Text-to-Speech service
AIService.ts - WebSocket AI client
ChatInput.tsx - Input component with SR/TTS controls
ConversationStateIndicator.tsx - State indicator
useChat.ts - Compatibility hook