Halo - Vision Companion
Team consisting of Lightbloom AI Engineers (MSc CV, +6 years experience) specializing in Deep Learning, Computer Vision, Medical Imaging, and ASR/LLM systems.
YouTube Video
Project Description
Halo is a voice-first AI assistant for visually impaired users that combines real-time speech recognition, computer vision, and natural language processing. Users speak naturally (“What’s on this menu?”, “How much money is this?”, “Where’s the door?”) and Halo responds with detailed spoken descriptions using a personalized cloned voice of a family member for comfort and familiarity.
USER PROBLEM: 285 million visually impaired people struggle with reading menus, counting cash, and navigating unfamiliar spaces. Current solutions require expensive hardware or complex apps.
CORE FLOW: User speaks → ElevenLabs STT transcribes → Camera captures scene → OpenAI Vision analyzes (auto-detects menu/cash/scene) → GPT-4o generates response → ElevenLabs TTS speaks in cloned familiar voice.
KEY DEMOS: (1) Menu Reading - reads items, prices, dietary notes (2) Cash Counting - identifies bills, calculates total (3) Scene Navigation - describes obstacles and directions.
JUDGING CRITERIA:
- Working Prototype: Fully functional end-to-end voice pipeline with real-time STT, TTS, vision analysis, and conversational AI.
- Technical Complexity: Multi-modal AI (voice+vision+language), real-time streaming, smart content detection, voice cloning, cross-platform (web/iOS/Android via Capacitor).
- Innovation: Familiar cloned voice reduces anxiety for blind users; zero-learning-curve natural speech interface; context-aware conversation memory.
- Real-World Impact: 285M potential users; enables independence, dignity, and safety for visually impaired individuals.
- Theme Alignment: Agentic AI with autonomous decision-making, multi-modal input/output, conversational memory, and multi-service tool integration.
TECHNOLOGIES:
- APIs: ElevenLabs Scribe (STT), ElevenLabs TTS + Voice Cloning, OpenAI GPT-4o (Vision), OpenAI GPT-4o-mini (conversation)
- Stack: Vite, Vanilla JavaScript, MediaRecorder API, Canvas API, Capacitor (mobile)
- Browser: getUserMedia (mic/camera), Web Audio API
SETUP:
git clone https://github.com/Munkhchimeg-Sergelen/halo-vision-companion.git
cd halo-vision-companion && npm install
Create .env with: VITE_ELEVENLABS_API_KEY, VITE_OPENAI_API_KEY, VITE_ELEVENLABS_VOICE_ID
npm run dev → Open http://localhost:5173
DEMO: Allow mic/camera → Hold button and speak → Release → Hear AI response in cloned voice. Point camera at menu/cash → Click capture → Hear description.
Prior Work
All code was created during the hackathon. We started from a basic boilerplate repository (halo-vision-companion) that contained only empty placeholder files with TODO comments and no functional code. The boilerplate included file structure and module stubs but zero implementation.
During the hackathon, we built from scratch:
- Complete ElevenLabs STT integration (Speech-to-Text using Scribe API)
- Complete ElevenLabs TTS integration (Text-to-Speech with voice cloning)
- Microphone capture using MediaRecorder API
- OpenAI GPT-4o Vision integration for scene/menu/cash analysis
- OpenAI GPT-4o-mini conversational orchestrator
- Menu reader logic (parsing items, prices, categories)
- Cash detector logic (identifying bills, counting totals)
- React UI with accessible design
- Mobile deployment setup (Capacitor for iOS/Android)
No pre-existing AI models, trained data, or functional code was used. All API integrations, prompts, and application logic were written during the 3-hour hackathon session.