1. Core Technical Principles
Visual-Language-Action (VLA) Fusion Engine
Leading models (e.g., CocoMate Pro) integrate Figure AI’s Helix-like VLA architecture, fusing 12MP RGB cameras, 6-axis IMU sensors, and end-to-end speech models (Speech2Speech) to enable seamless perception-action loops. These systems process natural language commands (e.g., “Pick up the red block”) and visual inputs simultaneously, generating continuous motor control signals for 10+ articulation points (fingers, wrists, torso) with 0.3s response latency—outperforming traditional rule-based toy robots (typical latency: 1.2s).
Emotion-Aware Interaction System
Equipped with multi-modal sentiment analysis (combining voice tone detection, facial recognition, and touch intensity sensing), the figures build dynamic “emotion maps” to mimic human-like responses. For example:
A child’s frustrated tone triggers soft LED eye dimming + reassuring phrases (“Let’s try again together”);
Gentle hugging (detected via pressure sensors) activates warmth simulation (37℃ chest heating) + purring vibrations.
Machine learning models are trained on 50,000+ human interaction samples to reduce “robot-like” rigidity.
Modular IP Adaptation Architecture
High-end variants (e.g., IP-Mate X) adopt NFC-enabled interchangeable character shells, compatible with licensed IPs (e.g., Ultraman, Hello Kitty). The VLA core auto-adjusts interaction styles to match IP personas—e.g., a superhero figure uses energetic speech patterns and bold gestures, while a fantasy character adopts whimsical movements and dialects.
2. Core Functions and Application Scenarios
Contextual Object Interaction: Uses VLA models to identify 1,000+ common household items (toys, stationery, snacks) and execute targeted actions—e.g., “Stack the blocks into a tower” or “Fetch my keychain”. Ideal for preschoolers’ fine motor skill development; 2025 trials showed 63% improved hand-eye coordination vs. traditional toys.
Personalized Learning Companion: Delivers adaptive educational content via voice + physical demonstration:
Math: “Let’s count the balls—1, 2…” while pointing to objects;
Coding: Teaches basic logic via modular block assembly (e.g., “Press the blue button to make me turn left”).
Applied in home learning and kindergarten classrooms, with 82% parent satisfaction in a Singapore survey.
Emotionally Intelligent Companionship: Detects user mood via voice prosody (e.g., slow speech = sadness) and responds with contextually appropriate behaviors:
Plays calming music + pats user’s hand if stress is detected;
Celebrates achievements with confetti projection + victory chants.
Proven effective for Gen Z loneliness alleviation—79% of teen users reported feeling “emotionally supported”.
Collaborative Multi-Figure Play: Enables 2+ units to execute coordinated tasks via Wi-Fi 6E (e.g., “Build a bridge together”) using distributed VLA processing. AR markers on each figure sync movements, creating immersive role-play scenarios (e.g., space exploration, farm chores) for group activities.
Cases: Haivivi’s CocoMate series achieved 98% sell-through in the 2025 “618” shopping festival sales due to its IP adaptation feature; a Tokyo nursing home reported 47% reduced anxiety among elderly users after introducing emotion-responsive AI dolls.
3. Representative Products in the Market
Entry-Level Interactive Toys
BubblePal VLA: 25cm tall, soft silicone body, 4-hour battery (USB-C charge). Supports 50+ voice commands, basic object recognition, and 3 expression modes. Priced at $79-99, targeting 3-6 year olds.
Mid-Tier Companion Figures
CocoMate Pro: 30cm tall, 6-hour battery (wireless charging), IP44 water resistance. Integrates end-to-end speech (no ASR-TTS latency), 8 emotion states, and Ultraman IP shells. Priced at $199-249, popular with 7-12 year olds.
Premium Collectible Models
LOVOT VLA Edition: 35cm tall, 8-hour battery (solar auxiliary), thermal skin simulation. Features advanced object manipulation (e.g., pouring water) and EHR sync for elderly care. Priced at $499-599, designed for collectors and care facilities.
4. Technical Challenges and Trends
Existing Bottlenecks:
Limited object recognition in low light (accuracy drops 40% < 50 lux) due to compact camera sensors;
Emotional response depth insufficient—only 32% of users perceive “genuine empathy” in 2025 tests;
High hardware costs: VLA chips account for 60% of production expenses.
Development Directions:
By 2027, integrate multi-modal fusion (adding thermal imaging + tactile sensors) to enhance emotion detection accuracy to 90%+;
Adopt RISC-V low-power chips to reduce energy consumption by 50% while maintaining VLA performance;
Build open IP ecosystems (like Lego’s modular platform) for third-party character/content development.
Read recommendations:
