A Starting Point
Where The Journey Begins
The Journey Begins
Musings
The nature of a journey suggests a meandering approach: letting each step reveal a path, understanding what exists and what does not, highlighting the gap. Curiosity regarding the absence suggests direction with an encouraging tug.
The Virtuous Circle
- Capabilities become known - What can be done
- Limitations are named - What can not
- Highlighting the Gap - What is missing
- Identifying what is to be done - Curiosity’s encouraging tug
- Pause and observation - Return to step 1
The Virtuous Circle needs a starting point. Creating the starting point requires the starting point to be named and described.
The Naming
AIlumina = AI + lumina (Latin: light, illumination)
The name combines “AI” with the Latin word for light.
The Describing
AIlumina: a multi-provider, real-time conversational agent. A working conversational AI with flexible, natural interaction patterns. Purely reactive, responds, but does not remember, converses, but doesn’t learn. Has no continuity of self.
Backend Architecture:
- Multi-provider support for multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq)
- Direct HTTP transport layer - no SDK dependencies, full protocol control
- WebSocket streaming for real-time communication
- Type-safe implementation
- Configuration based
Frontend Interaction:
- Four natural communication modalities:
- Type + Read (text in, text out)
- Speak + Read (voice in, text out)
- Type + Listen (text in, voice out)
- Speak + Listen (voice in, voice out)
- Independent Speech Recognition and Text-to-Speech toggle controls
- State machine for deterministic conversation flow
- Coordinated SR/TTS synchronization (AI doesn’t hear itself, seamless transitions)
- Browser-native APIs (Web Speech, Speech Synthesis) - zero backend configuration
Fundamental Absences:
- No temporal continuity - The system resets between sessions, forgetting everything
- No persistent memory - Cannot build knowledge over time or remember past interactions
- No persistent identity - No sense of “I” that persists across conversations
- No self-observation capability - Cannot reflect on its own operations or behavior
- No deterministic operations - Pure probabilistic token generation (System 1 thinking only)
- No tools or functions - Cannot perform reliable, verifiable operations
Implementation
The starting point, once named and described, must be built.
1. Multi-Provider Architecture - Supporting 6+ AI Providers
The Question: How does a single system support multiple AI providers (Anthropic, OpenAI, Google, Ollama, LMStudio, Groq) without duplicating code?
The Answer: Abstract base class defining common interface, with provider-specific implementations extending it.
The Architecture:
-
Base Provider Interface
- Abstract class defines common contract
- Properties: agent_name, service_provider, model_name, system_prompt
- Abstract method:
makeApiCall()- each provider implements differently - Common functionality in base, provider-specific in extensions
-
Provider Implementations
- AnthropicProvider extends BaseServiceProvider
- OpenAIProvider extends BaseServiceProvider
- GoogleProvider extends BaseServiceProvider
- Each handles its own message format, API specifics
-
Service Factory Pattern
- Looks up agent configuration
- Determines service_provider
- Instantiates appropriate provider class
- Returns unified ServiceProvider interface
-
Configuration-Driven Selection
- agents.json specifies service_provider per agent
- Different agents can use different providers
- Switch providers by changing configuration
- No code changes needed
The Flow:
WebSocket connection
↓ agent type from URL
↓ look up in agents.json
Agent configuration retrieved
↓ service_provider: "ANTHROPIC"
ServiceFactory creates provider
↓ new AnthropicProvider(config)
Provider ready to handle requests
Benefits:
- Flexibility: Easy to add new providers
- Consistency: Same interface regardless of provider
- Configuration: Switch providers without code changes
- Independence: Providers don’t know about each other
2. WebSocket Real-Time Streaming - Instant Response Delivery
The Question: How does the system deliver AI responses in real-time as they’re generated?
The Answer: WebSocket bidirectional communication with streaming support.
The Architecture:
-
WebSocket Connection Establishment
- Client connects to
/ws/{agent_type} - Server validates agent type exists
- Loads agent configuration
- Creates provider instance
- Maintains persistent connection
- Client connects to
-
Message Flow
- Client sends: user input + message history via WebSocket
- Server validates and filters messages
- Calls provider’s
makeApiCall()with streaming=true - Provider streams chunks back through WebSocket
- Client receives and displays incrementally
-
Streaming vs Non-Streaming
- Controlled by agent’s
do_streamconfiguration flag - Streaming: Sends chunks as generated, better UX
- Non-streaming: Waits for complete response, simpler handling
- Same WebSocket connection supports both modes
- Controlled by agent’s
-
Error Handling
- Validation errors sent immediately, connection closed
- Provider errors caught, user-friendly message sent
- Connection state managed properly on errors
The Flow:
User types message
↓ Client sends via WebSocket
Server receives
↓ filters message history
↓ calls provider.makeApiCall(ws, streaming=true)
Provider streams response
↓ sends chunks through WebSocket
Client receives chunks
↓ displays incrementally
Response complete
↓ waiting for next message
Benefits:
- Real-time: User sees response as it generates
- Bidirectional: Single connection for all communication
- Efficient: No polling, instant updates
- Flexible: Supports both streaming and batch modes
3. Direct HTTP Transport - Full Protocol Control
The Question: How does the system communicate with AI provider APIs without dependency on official SDKs?
The Answer: Custom HTTP transport layer using native fetch(), implementing provider protocols directly.
The Architecture:
-
No SDK Dependencies
- Native fetch() for HTTP requests
- No
@anthropic-ai/sdk,openai,@google/generative-aipackages - Direct implementation of provider API protocols
- Full control over request/response handling
-
Provider-Specific Transports
- Each provider has dedicated transport class
- Handles provider-specific message formatting
- Manages provider-specific headers, authentication
- Returns standardized TransportResult interface
-
Shared Result Interface
- All transports return:
{ type: 'streaming' | 'non_streaming', data, raw } - Provider differences hidden behind common interface
- Consumers don’t need provider-specific knowledge
- All transports return:
-
Independent Implementations
- No base transport class - each fully independent
- Reduces coupling, easier to modify per provider
- Provider API changes affect only that transport
- Clean separation of concerns
The Flow:
Provider needs to make API call
↓ selects transport (AnthropicAPITransport)
↓ formats messages to provider schema
Transport sends HTTP request
↓ fetch(provider_url, { provider_specific_format })
Provider API responds
↓ streaming or batch
Transport parses response
↓ converts to standard TransportResult
Provider returns to caller
Benefits:
- Full control: Complete visibility into requests/responses
- Minimal dependencies: Only native APIs, smaller bundle
- Debugging: Easy to inspect exact API communication
- Flexibility: Can customize behavior per provider
Trade-off: Custom maintenance vs SDK convenience
4. JSON-Based Agent Configuration - Declarative Agent Definitions
The Question: How are agents configured without hardcoding in source code?
The Answer: JSON configuration file (agents.json) defining all agent properties declaratively.
The Architecture:
-
Configuration Structure
- Single
agents.jsonfile - Each agent: key, agent_name, service_provider, model_name, system_prompt
- Optional: do_stream, available_functions, custom_settings, mcp_servers
- At baseline: available_functions is empty array
- Single
-
Configuration Loading
- AgentConfigManager reads agents.json on startup
- Provides getAgentConfig(name) lookup
- Validates required fields
- Caches in memory for fast access
-
Runtime Selection
- WebSocket URL includes agent type:
/ws/AIlumina - System looks up “AIlumina” in loaded configurations
- Instantiates with those specific settings
- Multiple agents, different configurations, same codebase
- WebSocket URL includes agent type:
-
Configuration-Driven Behavior
- System prompt shapes agent personality
- service_provider selects which AI backend
- model_name chooses specific model
- do_stream controls streaming behavior
- available_functions: empty at baseline (no tools yet)
The Flow:
Server startup
↓ reads agents.json
↓ loads all agent configurations
Client connects to /ws/AIlumina
↓ looks up "AIlumina" in config
Configuration found
↓ service_provider: ANTHROPIC
↓ model: claude-3-5-sonnet-20241022
↓ system_prompt: "You are AIlumina..."
Agent created with those settings
Benefits:
- Declarative: Agent behavior defined in data, not code
- Flexible: Change agents without code changes
- Multiple agents: Single codebase, many configurations
- Extensible: Ready for future capabilities (available_functions)
5. Natural Interaction Modalities - Flexible Communication
The Question: How does the system support natural, flexible interaction rather than forcing users into a single mode?
The Answer: Four interaction modalities through independent Speech Recognition and Text-to-Speech controls.
The Architecture:
-
Four Natural Modalities
- Type + Read: Text in, text out (traditional chat)
- Speak + Read: Voice in, text out (dictation mode)
- Type + Listen: Text in, voice out (accessibility/multitasking)
- Speak + Listen: Voice in, voice out (conversation mode)
-
Independent Toggle Controls
- Speech Recognition (SR) toggle - controls input modality
- Text-to-Speech (TTS) toggle - controls output modality
- Independent state: SR can be on while TTS off, or vice versa
- Users switch modes mid-conversation as needs change
-
Browser-Native APIs
- Speech Recognition: Web Speech API (window.SpeechRecognition)
- Text-to-Speech: Speech Synthesis API (window.speechSynthesis)
- Zero backend configuration required
- Works entirely in browser
-
Modality Mixing
- Start typing, switch to voice input mid-conversation
- Read responses while working, switch to listening when hands-free
- Natural transitions match how humans actually communicate
- System adapts to user’s current context
The Flow:
User enables SR + TTS (Speak + Listen mode)
↓ speaks: "What's the weather?"
SR captures speech → text
↓ sends to AI
AI responds with text
↓ TTS synthesizes to speech
User hears response
↓ can toggle to Type + Read anytime
Benefits:
- Natural: Matches human communication flexibility
- Accessible: Multiple input/output options
- Context-adaptive: Switch modes as situation changes
- Zero configuration: Browser APIs, no backend setup
6. SR/TTS Synchronization - Solving the Coordination Challenge
The Question: When both Speech Recognition and Text-to-Speech are active, how do you prevent the AI from hearing itself and coordinate state transitions?
The Answer: Observer pattern with useRef state tracking to manage interdependent lifecycles without React closure issues.
The Challenges & Solutions:
-
Feedback Prevention
- Problem: When TTS speaks, SR hears AI and creates feedback loop
- Solution: TTS observer stops SR when utterance starts
- Implementation: Observer listens to TTS “start” event → pause SR
-
Seamless Restart
- Problem: SR must restart after TTS completes automatically
- Solution: TTS observer restarts SR on utterance “end” event
- Implementation: Observer listens to “end” → restart SR if enabled
-
Independent State Control
- Problem: User can toggle SR/TTS independently mid-conversation
- Solution: Separate boolean flags in state machine context
- Implementation: speechRecognitionEnabled, speechSynthesisEnabled tracked separately
-
Browser SR Auto-Restart
- Problem: Web Speech API auto-restarts SR every ~8 seconds of silence
- Creates: visible state transitions (listening → ready → listening)
- Solution: UI displays stable text “Speech recognition active…”, status indicator shows actual SR state
- User Experience: Continuity despite underlying cycling
-
React Stale Closures
- Problem: useEffect closures capture stale state values
- Impact: TTS observer depending on SR state breaks when SR changes
- Effect cleanup/recreation destroys TTS synchronization
- Solution: useRef to track SR state without effect dependencies
- Implementation: Observer reads current SR state from ref, not closure
The Architecture:
TTS utterance starts
↓ TTS observer detects "start" event
↓ checks SR ref: is it enabled?
↓ if yes: pause SR
AI speaks
↓ user hears response
TTS utterance ends
↓ TTS observer detects "end" event
↓ checks SR ref: is it enabled?
↓ if yes: restart SR
SR resumes listening
↓ ready for next user input
Benefits:
- No feedback: AI never hears itself speak
- Seamless UX: Automatic transitions, no manual intervention
- Stable: Survives browser SR recycling
- Independent: Users control SR and TTS separately
- Robust: No React closure bugs
7. State Machine for Conversation Flow - Deterministic UI State
The Question: How does the UI manage conversation state transitions reliably without race conditions?
The Answer: XState v5 finite state machine with flat hierarchy and explicit transitions.
The Architecture:
-
Four Core States
- WAITING: Ready for user input
- THINKING: Processing request (calling AI)
- RESPONDING: Streaming/displaying AI response
- ERROR: Error occurred, can retry
-
Explicit State Transitions
- WAITING → THINKING (on SUBMIT_TEXT)
- THINKING → RESPONDING (on AI_RESPONSE_RECEIVED)
- RESPONDING → WAITING (on AI_COMPLETE)
- Any → ERROR (on AI_ERROR)
- ERROR → THINKING (on retry)
-
Context for Data
- messages: conversation history
- aiResponse: current response being built
- speechRecognitionEnabled: SR toggle state
- speechSynthesisEnabled: TTS toggle state
- Independent boolean flags prevent coupling
-
Toggle Actions Separate from Flow
- TOGGLE_SPEECH_RECOGNITION action: update SR flag, don’t change state
- TOGGLE_SPEECH_SYNTHESIS action: update TTS flag, don’t change state
- User can toggle modalities without disrupting conversation flow
The Flow:
State: WAITING
↓ user submits message
↓ event: SUBMIT_TEXT
State: THINKING
↓ AI responds
↓ event: AI_RESPONSE_RECEIVED
State: RESPONDING
↓ response complete
↓ event: AI_COMPLETE
State: WAITING
Benefits:
- Deterministic: Same event in same state always produces same transition
- Impossible states prevented: Can’t be THINKING and WAITING simultaneously
- Testable: State transitions can be unit tested
- Type-safe: TypeScript ensures events and states are valid
- Flat hierarchy: Simple to understand, no nested complexity
8. Technology Choices - TypeScript, Bun, Browser APIs
The Question: What technology choices enable rapid, deterministic development?
The Answer: TypeScript for type safety, Bun for performance, browser-native APIs for zero config.
TypeScript Over Python:
- Type safety: Compile-time validation catches errors before runtime
- Deterministic behavior: Type constraints enforce correctness
- Unified language: Same language frontend and backend
- Superior tooling: IDE support, refactoring, intellisense
- Insight: Consciousness research requires deterministic operations—TypeScript’s type system provides guarantees dynamic typing cannot
Bun Over Node.js/npm:
- Test execution: 8-9ms (vs 5-10 seconds with Vitest)
- Package install: ~1 second (vs 30-60 seconds with npm)
- Native TypeScript: Direct execution, no transpilation overhead
- Drop-in replacement: Compatible with Node.js ecosystem
- Development velocity: Near-instant feedback loops
Browser-Native APIs:
- Speech Recognition: Web Speech API (zero backend setup)
- Text-to-Speech: Speech Synthesis API (built into browsers)
- No dependencies: No npm packages, no backend config
- Wide support: Chrome, Edge (SR), all modern browsers (TTS)
- Instant availability: Works out of the box
Benefits:
- Velocity: Sub-10ms test feedback, sub-second installs
- Reliability: Type safety catches errors early
- Simplicity: Browser APIs eliminate backend complexity
- Determinism: TypeScript enforces correctness at compile-time
Evidence
The implementation exists as runnable code:
Backend:
- base-provider.ts - Abstract base class for providers
- anthropic-api-transport.ts - Direct HTTP transport for Anthropic
- openai-api-transport.ts - Direct HTTP transport for OpenAI
- google-api-transport.ts - Direct HTTP transport for Google
- agent.ts - WebSocket handler
- agents.json - Agent configuration
Frontend:
- ConversationHSM.tsx - XState v5 conversation state machine
- ConversationHSMCoordinator.tsx - SR/TTS synchronization coordinator
- SRService.ts - Speech Recognition service
- ttsservice.ts - Text-to-Speech service
- AIService.ts - WebSocket AI client
- ChatInput.tsx - Input component with SR/TTS controls
- ConversationStateIndicator.tsx - State indicator
- useChat.ts - Compatibility hook