Portable Animatronic Fish

OpenMouth AI

This project reimagines the classic 1998 Big Mouth Billy Bass for 2026: a smartphone-controlled animatronic fish that plays soundboard clips, speaks typed phrases, or delivers lightweight GPT responses, with mouth, body, and tail motion synchronized through a custom audio-to-motion tuning pipeline.

Raspberry Pi 4B HiFiBerry DAC+ ADC 2× DRV8833 Phone hotspot uplink Portable 80/20 frame

Mode 01

Soundboard

Phone buttons trigger stored WAV or MP3 clips on the Pi. This is the fastest path to a reliable demo and the best first build target.

Mode 02

Text-to-Speech

Typed phrases are converted to speech, saved as audio, analyzed for mouth motion, and played through the fish speakers.

Mode 03

Lightweight GPT

The Pi connects through the phone hotspot, calls a lightweight GPT model, converts the response to speech, and animates the fish from the generated waveform.

Architecture at a glance

The phone is both the remote control and the internet source. The Pi remains the local brain responsible for audio playback, motion extraction, and motor control.

SmartphoneBrowser UI
Phone hotspot
Soundboard / text / GPT prompt

Raspberry Pi 4BFastAPI app
Audio job queue
TTS + GPT clients

Audio PathWAV/MP3 playback
HiFiBerry DAC+ ADC
Lightweight speakers

Motion PathSpeech-band envelope
Onset pulses
Motion JSON timeline

Gemmy FishMouth motor
Body motor
Tail motor

Hardware stack

Subsystem	Selected hardware
Compute	Raspberry Pi 4B, 2GB
Audio I/O	HiFiBerry DAC+ ADC, primarily used for DAC output in this architecture
Motor control	Two DRV8833 dual H-bridge motor drivers with fault/current-protection capability
Motors	Existing Gemmy fish mouth, front body, and tail DC motors
Speakers	Two lightweight powered speakers or compact amplifier + passive speakers
Power	USB-C PD power bank for Pi/audio; separate small battery pack for fish motors
Structure	80/20 aluminum extrusion frame for portable mounting and cable management

Software stack

Layer	Downselected choice
Web server	FastAPI for phone UI routes, status, and future WebSocket control
Laptop simulation	Anaconda environment with ASCII digital twin, MP3/WAV input, and motion JSON export
Audio engine	ALSA-backed WAV playback through the HiFiBerry output
TTS	Piper as the default local TTS path; cloud TTS as an optional premium path
GPT	Lightweight cloud model accessed through phone hotspot internet
Motor API	gpiozero first; pigpio fallback if PWM timing jitter becomes visible
Startup	systemd service to boot directly into fish-control mode

Audio-to-motion pipeline

The build treats soundboard clips, TTS, and GPT speech as the same downstream object: an audio file with a motion timeline.

Input command → load or generate audio file → compute speech-band envelope → detect speech onsets, phrase starts, and strong accents → assign mouth/body/tail pulse timeline → clamp pulse widths and enforce recovery time → start audio playback and motor timeline together

Motor channel map

Two dual H-bridge DRV8833 boards provide four motor outputs. Three are used for the fish, leaving one spare channel for future lighting, a second mouth action, or an auxiliary prop. Each axis is driven as a short pulse into the original spring-return cam mechanism, not as a continuously held servo axis.

DRV8833 #1 Channel A → mouth motor Channel B → front body/head motor DRV8833 #2 Channel A → tail motor Channel B → spare Default command forward pulse → coast/off → spring return Do not command continuous hold against endstop

Mechanical protection layer

The stock Gemmy axes are not servo-controlled. Each motor is treated as a small DC actuator pushing a cam or linkage into motion, then released so the original spring can return the mechanism home.

Protection rule	Design intent
Pulse, do not hold	Command short forward pulses only. Avoid continuous drive into mechanical endstops.
Coast after each pulse	Set both DRV8833 inputs low after each command so the motor is off and the spring can return the axis.
Max-on watchdog	Every motor command is clamped in software even if a higher-level animation request is wrong.
Minimum off-time	Allow mechanical recovery before the next pulse, especially for the mouth axis during speech.
Fault-aware control	Use DRV8833 nFAULT, if exposed, as a diagnostic signal. A fault means the pulse/PWM/current limit is too aggressive.

DRV8833 pulse policy

The DRV8833 improves the electrical safety envelope, but it does not turn the fish into a closed-loop servo. Current protection is a backstop; safe pulse choreography is the primary protection.

Forward pulse IN1 = PWM IN2 = LOW wait pulse_ms Coast / release IN1 = LOW IN2 = LOW Initial tuning targets mouth: 50–120 ms, 35–55% PWM body: 120–300 ms, 35–55% PWM tail: 100–250 ms, 35–55% PWM

Prototype Plan

Bring up the Pi and HiFiBerry

Install Raspberry Pi OS, configure the HiFiBerry overlay, verify audio playback through the RCA output, and confirm the speaker path.

Create the phone UI

Build a minimal FastAPI page with three buttons that trigger known local WAV files from the phone browser.

Drive one motor first

Connect only the mouth motor through one DRV8833 channel. Use short, conservative pulse tests before attempting audio-following motion.

Add envelope-following

Compute RMS amplitude from the WAV file, detect speech onsets, and map them into short mouth pulses with smoothing, thresholds, clamp limits, and mandatory off-time.

Add body and tail choreography

Use phrase starts, beat estimates, or simple timed accents to trigger the body and tail motors without making the fish look overactive.

Add TTS and GPT modes

Start with Piper or cached phrases, then add phone-hotspot GPT mode once the audio and motion systems are reliable.

Fish Risk Register

Risk	Mitigation
Motor noise corrupts audio/Pi power	Use separate motor battery, common ground, short motor wiring, and bulk capacitance near drivers.
Mouth motion lags speech	Precompute motion timelines from WAV files and start audio + motor playback from one monotonic clock.
PWM jitter	Prototype with gpiozero, then switch the motor layer to pigpio if visible jitter appears.
GPT mode needs internet	Use the phone hotspot as uplink. Keep soundboard and local TTS usable without cloud access.
DC motors are not position-controlled	Treat each motor as a pulse-driven cam mechanism. Enforce max-on time, minimum off-time, and spring-return recovery.
Endstop stall damages pinions/cams	Use DRV8833 current/fault protection as a safety net, but rely primarily on software pulse limits and conservative PWM.

Laptop audio-to-motion development

Before the Raspberry Pi arrives, develop the motion algorithm on macOS as an offline simulator. The working development loop is: audio file → speech-band analysis → safe motor pulse timeline → ASCII fish digital twin → exported motion JSON.

Mac laptop development loop → create or load WAV/MP3 audio → run openmouth_digital_twin_ascii.py → view ASCII fish + timeline cursor → inspect mouth/body/tail trigger events → export outputs/test.motion.json → later copy JSON + audio to Raspberry Pi runtime

1. Create the Anaconda environment

conda create -n openmouth python=3.11 -y conda activate openmouth conda install -c conda-forge numpy scipy matplotlib librosa soundfile ffmpeg jupyterlab ipykernel -y pip install sounddevice python -m ipykernel install --user --name openmouth --display-name "Python (openmouth)"

2. Create the project folders

mkdir openmouth-audio-motion cd openmouth-audio-motion mkdir sounds outputs # Place openmouth_digital_twin_ascii.py in this folder.

3. Generate a known-good test WAV

Use macOS text-to-speech to create a deterministic first test clip. This avoids debugging the algorithm with noisy or compressed source audio.

say "Hello, I am Open Mouth AI. I am a talking fish." -o sounds/test.aiff ffmpeg -i sounds/test.aiff sounds/test.wav

4. Run the ASCII digital twin from Terminal

conda activate openmouth python openmouth_digital_twin_ascii.py sounds/test.wav --preset gentle

The simulator uses macOS afplay by default for cleaner playback while Matplotlib animates. If audio clips or crackles, reduce playback gain.

python openmouth_digital_twin_ascii.py sounds/test.wav --preset gentle --playback-gain 0.35

5. Try the motion presets

Preset	Use
mouth_only	Safest first-pass algorithm mode. Only the mouth axis receives pulse events.
gentle	Conservative mouth motion with sparse body and tail accents. Best default.
animated	More lively puppet-like behavior. Useful for showmanship, but more aggressive.

python openmouth_digital_twin_ascii.py sounds/test.wav --preset mouth_only python openmouth_digital_twin_ascii.py sounds/test.wav --preset gentle python openmouth_digital_twin_ascii.py sounds/test.wav --preset animated

6. Export the future Pi motion file

python openmouth_digital_twin_ascii.py sounds/test.wav --preset gentle --export-json outputs/test.motion.json

The output JSON is the contract between the offline algorithm and the future Raspberry Pi motor runtime.

{ "t": 0.320, "motor": "mouth", "pulse_ms": 82, "pwm": 0.43, "reason": "speech_onset" }

7. Run inside Jupyter when tuning parameters

Use Jupyter for inspecting variables, event counts, and thresholds. Use Terminal for the most reliable synchronized audio + animation.

conda activate openmouth jupyter lab

Inside a notebook, select Python (openmouth) as the kernel, then either run the script directly:

%run openmouth_digital_twin_ascii.py sounds/test.wav --preset gentle

Or import the analyzer for interactive tuning:

from openmouth_digital_twin_ascii import analyze_audio, PRESETS, DigitalTwin audio_path = "sounds/test.wav" preset = PRESETS["gentle"] audio, sr, times, env_norm, threshold, events = analyze_audio(audio_path, preset) print(f"sample rate: {sr}") print(f"event count: {len(events)}") events[:5]

8. Parameters to tune first

Symptom	Adjustment
Mouth flaps too often	Increase threshold percentile, onset delta, or mouth minimum gap.
Mouth misses syllables	Lower threshold percentile or onset delta.
Motion looks twitchy	Increase envelope smoothing or mouth minimum gap.
Motion looks sluggish	Decrease smoothing or mouth minimum gap.
Pulse events seem too aggressive	Lower pulse width and PWM ranges before testing hardware.

Design philosophy

The first milestone is not conversational AI. The first milestone is a robust, portable, phone-controlled fish that can play one audio clip and flap its mouth convincingly. Once that is stable, TTS and GPT become input modes rather than architectural risks.

Exact bring-up instructions

The goal of the first bring-up milestone is intentionally narrow:

Phone browser → FastAPI UI running on Raspberry Pi → button press → WAV playback through HiFiBerry DAC → speakers verified

1. Flash Raspberry Pi OS

Install Raspberry Pi Imager

Download Raspberry Pi Imager on your laptop and insert a 32–64 GB microSD card.

Select OS

Choose Raspberry Pi OS Lite (64-bit). Lite is preferred because this project does not need a desktop environment.

Preconfigure Wi-Fi and SSH

In the advanced settings menu, enable SSH, configure your Wi-Fi network or phone hotspot SSID/password, and set a hostname such as openmouth.

Flash the SD card

Write the image, insert the card into the Pi, connect Ethernet or Wi-Fi, and power the Pi from the USB-C PD battery.

2. Connect to the Raspberry Pi

ssh pi@openmouth.local # or ssh pi@<Pi_IP_Address>

Update the system:

sudo apt update sudo apt upgrade -y

3. Install the HiFiBerry DAC+ ADC

Power OFF the Pi

Never attach the HiFiBerry board while powered.

Attach the HiFiBerry board

Mount the DAC+ ADC onto the Pi GPIO header carefully and verify all pins align correctly.

Connect speakers

Use RCA or RCA-to-3.5 mm cables into powered speakers.

4. Enable the HiFiBerry overlay

Edit the Raspberry Pi boot config:

sudo nano /boot/config.txt

Add these lines near the bottom:

dtoverlay=hifiberry-dacplusadc # OR for the Pro variant: # dtoverlay=hifiberry-dacplusadcpro

Disable onboard audio:

dtparam=audio=off

Reboot:

sudo reboot

5. Verify the HiFiBerry audio device

After reboot:

aplay -l arecord -l

You should see a HiFiBerry audio device listed.

6. Install Python dependencies

sudo apt install -y python3-pip python3-venv git mkdir ~/openmouth cd ~/openmouth python3 -m venv venv source venv/bin/activate pip install fastapi uvicorn sounddevice numpy scipy gpiozero

7. Create a minimal FastAPI app

Create app.py:

from fastapi import FastAPI from fastapi.responses import HTMLResponse import subprocess app = FastAPI() HTML = """ <html> <body style='font-family:sans-serif;padding:40px;'> <h1>OpenMouth AI</h1> <button onclick=\"fetch('/play')\">Play Test Audio</button> </body> </html> """ @app.get("/", response_class=HTMLResponse) def root(): return HTML @app.get("/play") def play(): subprocess.Popen([ "aplay", "sounds/test.wav" ]) return {"status": "playing"}

8. Add a test WAV file

mkdir sounds

Place a small WAV file inside:

sounds/test.wav

9. Run the FastAPI server

source venv/bin/activate uvicorn app:app --host 0.0.0.0 --port 5000

10. Verify the phone UI

Connect phone and Pi to same network

Usually your phone hotspot network.

Open browser

Navigate to http://openmouth.local:5000 or the Pi IP address.

Press Play Test Audio

You should hear audio through the speakers connected to the HiFiBerry board.

11. First DRV8833 motor bring-up

After audio and phone UI are stable, test only one motor channel first. The goal is to find the minimum pulse that creates visible motion without audible buzzing at the hard stop.

Motor test order 1. Disconnect all fish motors except the mouth motor. 2. Use a separate low-voltage motor supply if available. 3. Command 40 ms at 35–40% PWM. 4. Increase pulse duration in small steps: 60 ms, 80 ms, 100 ms. 5. Stop increasing once the mechanism clearly actuates. 6. Set software max-on below the first pulse that causes hard-stop buzz. 7. Repeat for body and tail only after mouth behavior is safe.

Use coast mode as the default release state:

DRV8833 channel convention Forward pulse: IN1 = PWM, IN2 = LOW Reverse pulse: IN1 = LOW, IN2 = PWM # usually unused Coast/off: IN1 = LOW, IN2 = LOW Brake: IN1 = HIGH, IN2 = HIGH # avoid initially

12. Troubleshooting checklist

Problem	Likely cause
No sound	Wrong audio output selected, onboard audio not disabled, or powered speakers not enabled.
FastAPI inaccessible from phone	Phone and Pi not on same network, firewall issue, or wrong IP.
Audio stutters	Weak power supply or USB battery incapable of stable Pi 4 current delivery.
HiFiBerry not detected	Overlay typo, improper seating on GPIO header, or reboot not performed.
Wrong audio device	HDMI audio still selected instead of HiFiBerry ALSA device.

13. First success milestone

At this point, the system should support:

Smartphone browser → FastAPI button → WAV playback → HiFiBerry → powered speakers.

Do not add motors yet. Verify audio stability and phone UI reliability first.

Python pipeline — modular architecture & song tuning

The offline laptop prototype is built as a proper Python library (openmouth/) with a strict boundary between Pi-safe signal processing and Jupyter-only visualisation tools. Five Jupyter notebooks walk through the full workflow from raw audio to validated motion JSON.

Library modules

openmouth/ is Pi-safe (numpy/scipy/librosa only). The ASCII twin imports it without circular dependency.

audio.pyload_audio · bandpass_speech (300–3400 Hz) · RMS envelope · smooth · normalize

onset.pyenvelope_crossings → mouth times · librosa onsets → body/tail times

motion.pybuild_timeline · Preset dataclass · PRESETS dict · safety clamps

exporter.pyvalidate_timeline · export_json · .motion.json contract

twin.pyASCII fish render · big-O mouth indicator · afplay sync · animate()

Jupyter notebook workflow

Notebook	Purpose	Key output
01_audio_analysis	Inspect the raw signal chain — waveform, bandpass filter, RMS envelope stages	Visual understanding of why speech-band filtering matters
02_onset_tuning	Interactive ipywidgets sliders for threshold, smoothing, gap, onset_delta	Tuned parameter set for a specific audio file
03_motion_timeline	Stacked timeline plot — waveform, envelope, per-motor event bars (width = pulse_ms, opacity = PWM)	Visual QA before export
04_export_and_validate	Safety validation then export to .motion.json	Pi-ready motion file + event density chart
05_ascii_twin	ASCII fish animates in Jupyter cell in sync with afplay audio	Real-time sanity check of the full motor timeline

Built-in preset library

Six presets ship with the library. Each is a Preset dataclass instance registered in PRESETS. batch_tune.py selects one automatically by matching filename keywords.

Preset	threshold	smooth	mouth_gap	Best for
hiphop	0.54	4	0.13 s	Default — rap, hip-hop, compressed pop vocals
pop	0.45	7	0.10 s	Sustained pop vocals (All Star, I Will Survive, This Love)
jpop	0.32	5	0.10 s	Dynamic J-pop / anime (Cruel Angel's Thesis)
ballad	0.22	9	0.12 s	Slow ballads and acoustic tracks with wide dynamics
edm	0.50	3	0.08 s	Electronic / dance — very fast transient response
speech	0.18	6	0.09 s	Speech-heavy tracks, podcasts, McDonald's jingle

from openmouth.motion import PRESETS, Preset # Use a built-in preset preset = PRESETS["hiphop"] # Keyword auto-selection (same logic as batch_tune.py) # cruelangel*.mp3 → jpop | mcdonalds*.mp3 → speech | default → hiphop

pop_song preset

All Star — Smash Mouth

A maximally compressed pop track. The speech-band envelope stays above 0.20 for ~80% of the clip — the default gentle threshold of 0.18 barely crosses upward, producing only ~0.5 mouth events/second. Raising to 0.45 puts it in the zone where the envelope fluctuates with syllable energy.

Parameter	Value	Rationale
threshold	0.45	Envelope stays high; need a higher crossing point
smoothing_frames	7	Standard — keeps envelope clean
body_min_gap_s	0.50 s	Body bobs at ~104 BPM beat rate
tail_min_gap_s	1.00 s	One sweep per bar
onset_delta	0.07	Standard onset sensitivity

Duration : 60.0 s | 235 events total mouth : 137 (2.28/s) pulse 60–118 ms body : 64 (1.07/s) pulse 141–237 ms tail : 34 (0.57/s) pulse 120–199 ms ✅ No safety warnings

jpop preset

Cruel Angel's Thesis

A traditionally mastered J-pop track with real dynamic range. The envelope only exceeds 0.45 for ~31% of the clip, so pop_song starves the vocal sections. The first ~20 s is the instrumental organ intro — fewer events there is correct and expected.

Parameter	Value	Rationale
threshold	0.32	Peak crossing zone for this song's dynamic range
smoothing_frames	5	Lighter — crisper syllable tracking
body_min_gap_s	0.55 s	~130 BPM; slightly sparser body bobs
tail_min_gap_s	1.00 s	One sweep per bar
onset_delta	0.06	More sensitive to J-pop percussion transients

Duration : 60.0 s | 196 events total mouth : 102 (1.70/s) pulse 61–108 ms body : 61 (1.02/s) pulse 141–228 ms tail : 33 (0.55/s) pulse 126–190 ms ✅ No safety warnings

Audio analysis — signal chain & event timelines

Top row: raw waveform amplitude. Bottom row: normalized speech-band envelope (orange fill), threshold (red dashed), mouth events (orange lines), body events (purple, bottom 40%), tail events (teal, bottom 25%).

Why different thresholds?

Each curve shows the percentage of time the normalized envelope exceeds a given threshold. All Star (orange) stays high everywhere — a direct result of heavy pop compression. Cruel Angel's Thesis (purple) has a steeper drop-off, reflecting its traditional dynamic range. The shaded bands mark each song's tuned threshold, sitting at the inflection point where crossings are most meaningful.

Event density per 10-second window

Mouth, body, and tail events distributed across each 60-second clip. The low count in the first two windows of Cruel Angel's Thesis is intentional — those are the instrumental intro bars before the vocalist enters. Both songs maintain consistent density across their vocal sections.

Tuning guide — adapting a new song

The key diagnostic is the envelope distribution: check what percentage of time the envelope exceeds various thresholds, then pick the value at the inflection point where the curve bends steeply — that is where crossings are most responsive to actual vocal energy.

Symptom	Diagnosis	Fix
Too few mouth events (<0.8/s)	Threshold too high for this song's dynamic range	Lower threshold toward the envelope's inflection point
Too many mouth events (>3.5/s)	Threshold too low — firing on background energy	Raise threshold or increase smoothing_frames
Events bunched, then silent	Uneven dynamics (intro vs chorus)	Use norm_percentile=95 for more aggressive ceiling, or clip to vocal section only
Events feel jittery / chattery	Smoothing too light for this style	Increase smoothing_frames (5 → 7 → 10)
Syllables blurring together	Smoothing too heavy	Decrease smoothing_frames; increase mouth_min_gap_s slightly
Body / tail too active	Onset detector too sensitive	Increase onset_delta (0.06 → 0.08 → 0.10)

# Quick diagnostic — run before picking a threshold import numpy as np from openmouth.audio import (load_audio, bandpass_speech, rms_envelope, smooth_envelope, normalize_envelope) audio, sr = load_audio("sounds/my_song.mp3", target_sr=22050) filtered = bandpass_speech(audio, sr) env_raw = rms_envelope(filtered, frame_length=512, hop_length=256) env_s = smooth_envelope(env_raw, window_frames=7) env_n = normalize_envelope(env_s, percentile=98.0) print("Envelope distribution:") for t in [0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50]: pct = 100 * np.mean(env_n > t) bar = "X" * int(pct / 2) print(f" > {t:.2f}: {pct:5.1f}% {bar}") # Rule of thumb: pick the threshold where the value drops from >60% to ~30-40%. # That inflection is where the envelope has the most meaningful crossings.

How events are triggered — mouth, body, and tail

The mouth and the body/tail motors are driven by two completely independent audio features. Understanding the difference is the key to tuning them independently.

MOUTH — RMS ENVELOPE THRESHOLD CROSSINGS

The audio is bandpass-filtered (≈80–3000 Hz, the speech band) to remove low rumble and high noise. The RMS energy is computed in short overlapping windows (~23 ms), smoothed by a moving average (smoothing_frames), and normalized so the loudest moment in the track equals 1.0.

The mouth fires on upward crossings of threshold — the exact frame where the envelope rises from below to above the value. A sustained loud note produces exactly one event at the moment it starts, not a continuous stream.

After all crossings are collected, a gap filter (mouth_min_gap_s) discards any crossing that arrives too soon after the previous one, preventing rapid double-fires on staccato syllables.

Each crossing becomes a MotionEvent with a randomly drawn pulse_ms (e.g. 60–115 ms) and duty cycle. On the physical fish the motor drives for exactly that many milliseconds, then power cuts and a spring returns the jaw to closed. Duration of the open state is controlled entirely by pulse_ms and the spring — there is no separate "close" command.

BODY & TAIL — LIBROSA ONSET DETECTION

Body and tail events are driven by a completely different signal: librosa's spectral flux onset detector, which measures how quickly the frequency content changes. It is sensitive to percussive transients — drum hits, consonant bursts, strong beat attacks — not to sustained loudness.

onset_delta sets the minimum strength a local peak must exceed its local average. Lower delta → more onsets detected → denser body/tail events. Both motors draw from the same onset pool; what differentiates them is only the gap filter applied afterward.

Body vs Tail spacing example
Given onsets at [0.1, 0.3, 0.5, 0.9, 1.1, 1.8] s:
Body (gap 0.50 s) → keeps [0.1, 0.9, 1.8]
Tail (gap 1.00 s) → keeps [0.1, 1.1]
Tail is always a sparser subset of body events.

Like the mouth, each event fires for a randomly drawn pulse_ms and returns to rest when power cuts. The fin position (up vs. down) is a deterministic alternation in the digital twin; on the physical fish both fins have a single motor each, so direction is not controllable — only duration and duty cycle.

KEY ARCHITECTURAL POINT

Mouth and body/tail are driven by entirely independent features of the audio. A quiet verse with active drums can produce dense body/tail events with a completely closed mouth. A sustained loud note can hold the mouth open with no body or tail activity. Tuning them never conflicts — adjust threshold and mouth_min_gap_s for the mouth; adjust onset_delta and the two gap parameters for body and tail.

Threshold intuition

threshold is relative to the track's loudest moment after normalization. The envelope distribution table printed by tune_song.py tells you what percentage of the clip sits above each candidate value — use it to find the inflection point.

Scenario	What you see	Threshold guidance
Heavily compressed pop (All Star)	Envelope above 0.20 for ~80% of the track	Need a high threshold (≈0.45–0.54) to catch only syllable peaks
Dynamic J-pop (Cruel Angel)	Envelope above 0.45 for only ~31% of the track	Lower threshold (≈0.30–0.35) to avoid starving vocal moments
Instrumental section firing	Mouth opens when nobody is singing	Raise threshold, or run with `--no-vad` removed to re-enable VAD gating
Vocals barely trigger mouth	Mouth stays closed through singing	Lower threshold; also check smoothing_frames isn't washing out syllable peaks

Smoothing and gap intuition

smoothing_frames sets the moving-average window on the RMS envelope before threshold comparison. Larger windows blur rapid syllables into a single sustained lump; smaller windows track individual syllables but also track noise.

Parameter	Lower value	Higher value
`smoothing_frames`	Crisp per-syllable tracking (fast speech, J-pop)	Smooth phrase-level shapes (slow ballads, instruments)
`mouth_min_gap_s`	More events; can blur adjacent syllables together	Fewer events; sparser but cleaner open-close cycles
`onset_delta`	More body/tail events; tracks subtle transients	Fewer events; only strong beats and accents fire
`body_min_gap_s`	Body bobs at beat rate	Body bobs at bar rate
`tail_min_gap_s`	Tail moves more frequently (approaching body density)	Tail sweeps slowly, only on the strongest beats

batch_tune.py — one-command batch processor

batch_tune.py is the primary workflow for generating motion files. Run it once from openmouth-audio-motion/ to process every audio file in sounds/ and write .motion.json files to outputs/. The SoundPond web app picks up the results immediately — no server restart required.

Usage

# Tune everything with auto-selected presets python batch_tune.py # Force one preset for all files python batch_tune.py --preset hiphop # Per-file preset overrides python batch_tune.py \ --map McDonalds_60s.wav=speech \ --map LionKing_60s.wav=pop # Only process files whose JSON doesn't exist yet python batch_tune.py --skip-existing # Preview without writing any files python batch_tune.py --dry-run

Auto-preset keyword rules

Keyword in filename	Preset selected
mcdonalds, speech, podcast	speech
cruelangel, jpop, anime	jpop
lionking, musical, broadway	pop
survive, thislove	pop
ballad, acoustic, slow	ballad
edm, electronic, techno	edm
(anything else)	hiphop

Pipeline steps (per file)

load_audiolibrosa · 22 050 Hz mono

bandpassspeech 80–3400 Hz + bass 60–300 Hz

RMS envelope512-frame windows · smooth · normalize

envelope_crossingsmouth times (upward threshold crossings)

detect_speech_onsetsbody + tail times (librosa onset detector)

build_timelinepure pulse events sorted by time

build_timeline — simplified event model

All events are pure discrete pulses. The timeline builder applies no hold/sustain classification, no tempo scaling, no anticipation offsets, and no phrase-boundary suppression — features that were found to cause gear stalling and unnatural motion. Each crossing or onset maps directly to one MotionEvent.

Motor	Source	Mode	Spacing enforced by
Mouth	Upward envelope crossings of `threshold`	pulse only	`mouth_min_gap_s`
Body	Broadband librosa onset detector	pulse only	`body_min_gap_s`
Tail	Same broadband onset pool (sparser gap)	pulse only	`tail_min_gap_s`

## Sample batch_tune.py output Batch tuning 9 file(s) Sounds : /…/openmouth-audio-motion/sounds Outputs : /…/openmouth-audio-motion/outputs → All Star.mp3 [pop] ✓ 205.8s 104.3bpm mouth:147 body:183 tail: 94 phrases:0 [8.2s] → cruelangel_60s.wav [jpop] ✓ 60.0s 130.6bpm mouth:132 body:156 tail: 78 phrases:0 [2.1s] → McDonalds_60s.wav [speech] ✓ 60.0s 95.2bpm mouth:119 body: 88 tail: 44 phrases:0 [2.3s] ────────────────────────────────────────────────────────────────────────────── File Preset Dur BPM Scale Mth Bdy Tl Time ────────────────────────────────────────────────────────────────────────────── All Star.mp3 pop 205.8 104.3 ×1.00 147 183 94 8.2s cruelangel_60s.wav jpop 60.0 130.6 ×1.00 132 156 78 2.1s McDonalds_60s.wav speech 60.0 95.2 ×1.00 119 88 44 2.3s ────────────────────────────────────────────────────────────────────────────── 9/9 succeeded

🎬 Digital Twin Case Studies

The fish digital twin is generated in the jupyter notebook for crisper tuning development. Here are two real songs rendered through the full pipeline — audio analysis → motion timeline → image-frame animation → MP4. Each clip shows 30 seconds of the digital twin animating with all 8 body states active.

The 8 animation states

Every animation frame is one of these 8 images. The pipeline picks the correct image at each timestamp based on which motors are currently active. Front fin and tail fin each have two states (flat / flapped); mouth has two states (closed / open) — giving 2 × 2 × 2 = 8 combinations.

MOUTH CLOSED

Idle

Front flat · Tail flat

mouth 0 front 0 tail 0

Body accent

Front flap · Tail flat

mouth 0 front 1 tail 0

Tail sweep

Front flat · Tail flap

mouth 0 front 0 tail 1

Body + tail

Front flap · Tail flap

mouth 0 front 1 tail 1

MOUTH OPEN

Mouth only

Front flat · Tail flat

mouth 1 front 0 tail 0

Mouth + body

Front flap · Tail flat

mouth 1 front 1 tail 0

Mouth + tail

Front flat · Tail flap

mouth 1 front 0 tail 1

Full expression

Front flap · Tail flap

mouth 1 front 1 tail 1

Key: orange border = mouth open blue pill = front fin active green pill = tail fin active

"All Star" — Smash Mouth

hiphop preset

Clip: 00:10 – 00:40 · Preset: hiphop · Mode: full (all 8 states)

EVENT DENSITY — 5s WINDOWS

Events per motor per 5 s · top axis = track time

Mouth Body Tail

PRESET PARAMETERS

Smoothing frames	4
Threshold	0.54
Mouth min gap	0.13 s
Body min gap	0.50 s
Tail min gap	1.00 s
Onset delta	0.07

TUNING RATIONALE

All Star is maximally compressed: the RMS envelope stays above 0.20 for ~80% of the track. A high threshold of 0.54 sits at the inflection point where the envelope actually fluctuates with lyric energy, preventing the mouth from firing constantly on flat sections. Body bobs at beat rate (~2/s), tail sweeps at bar rate (~1/s).

"Cruel Angel's Thesis" — Yoko Takahashi

jpop preset

Clip: 00:25 – 00:55 · Preset: jpop · Mode: full (all 8 states)

EVENT DENSITY — 5s WINDOWS

Events per motor per 5 s · top axis = track time

Mouth Body Tail

PRESET PARAMETERS

Smoothing frames	5
Threshold	0.32
Mouth min gap	0.10 s
Body min gap	0.55 s
Tail min gap	1.00 s
Onset delta	0.06

TUNING RATIONALE

J-pop has real dynamic range unlike maximally-compressed Western pop — the envelope only exceeds 0.45 for ~31% of the track. Using pop_song's threshold of 0.45 would starve the vocals. A lower threshold of 0.32 with lighter smoothing (5 frames) preserves crisp Japanese syllable articulation. Body bobs at ~130 BPM beat structure.

Pipeline Comparison

Same 60-second clip processed through multiple presets — event counts per motor illustrate how preset parameters shape animation density.

Preset	Song character	Mouth	Body	Tail	Total / 60s
`hiphop`	Hip-hop / rap, sustained	80	97	61	238
`jpop`	J-pop / anime, dynamic range	102	61	33	196
`pop_song`	Western pop, compressed	307	197	140	644
`animated`	High-energy / expressive	196	205	205	606

Event counts from the full Cruel Angel's Thesis track (252s) across presets. Higher isn't always better — the goal is matching the song's rhythmic character, not maximizing raw count.

SoundPond — Web Controller

FastAPI server running on the Raspberry Pi at http://openbass.local:8000. Any device on the same Wi-Fi can open the page. SoundPond is a SoundCloud-inspired interface that auto-discovers every audio file in sounds/ and displays waveform previews with motion-sync status.

🎵

Auto-discovery

Scans sounds/ on startup. Drop a file in, restart, it appears. WAV · MP3 · FLAC · AIFF · M4A.

〰️

Live waveforms

200-bar RMS waveform for every track — hero scrubber + per-card mini waveform. Decoded in the background.

⚡

Motion sync

⚡ badge on tracks with a matching .motion.json. Play fires audio + all three motors in perfect sync.

🗣️

Live TTS

Type anything, fish speaks it via Piper neural TTS with full mouth sync via the same pipeline.

⚙️

DOF controls

Enable / disable mouth, body, or tail independently. Event divider (÷1 ÷2 ÷3) and carrier frequency per motor.

🎛️

Live PWM tuning

Drive PWM, pulse ms, coast ms sliders per motor. Changes apply to the very next event — no JSON regen needed.

                http://openbass.local:8000
              

SoundPond

🔊

🐟

All Star

OpenMouth AI · Audio + Motor sync ⚡

0:38

▶

3:25

▶

All Star ⚡

OpenMouth AI

3:25

Cruel Angel ⚡

OpenMouth AI

1:00

Ave Maria

OpenMouth AI

--:--

Gold Digger ⚡

OpenMouth AI

1:00

McDonald's ⚡

OpenMouth AI

1:00

↓ more tracks…

Now Playing

All Star

Audio + Motor sync ⚡ · hiphop preset

Active DOFs

✓ Mouth

✓ Body

✗ Tail

Event Divider

Body

÷1

÷2

÷3

Tail

÷1

÷2

÷3

PWM Carrier

All

125Hz

250

1kHz

2kHz

PWM Tuning — Mouth

Drive PWM

96%

Pulse ms

120

Coast ms

Live TTS

Say something…

Speak

REST API Endpoints

Method	Path	Description
GET	/sounds	List all discovered tracks with label, filename, and `has_motion` flag
GET	/waveform/{stem}	200-bar RMS waveform + duration for any audio file. Cached after first decode. Supports WAV (stdlib), librosa, or ffmpeg fallback.
GET	/play?file={name}	Play audio + fire motor timeline if `.motion.json` exists. Kills any current playback first.
GET	/status	Returns `{"playing", "elapsed", "duration"}` — polled every second to drive the waveform scrubber.
GET	/stop	Kill all playback and reset motor state immediately.
GET	/tts?text={phrase}	Synthesize speech via Piper TTS, compute motion timeline on the fly, play with mouth sync.
GET	/set_pwm	Live-update any motor parameter (drive_pwm, pulse_ms, coast_ms, freq, divider). No restart required.

🎙 Recording Studio — `/record`

Segment-based punch-in recorder. Play audio on the Pi, hold a key to fire each DOF, release to save the event. Each take is stored as a time-stamped segment; overlapping takes use a newer-wins compile strategy. Bake to a .motion.json when satisfied.

                http://openbass.local:8000/record
              

← Soundboard SoundPond 🎙 Record

🔊

−

65%

J Mouth K Body L Tail Space Start

Track

All Star.mp3

Motor

👄 Mouth

🐟 Body

〰 Tail

0:38 / 3:25

👄 🐟

Audio

⏮

▶ Play

⏸ Pause

⏹ Stop

▶ playing

⏹ STOP ALL

Record

⏮

⏺ Rec

⏸ Pause

⏹ Stop

Space

⏱ Reaction offset

150ms

● REC mouth 0:31 → 0:38 (4 events)

Segments

👄 Mouth 2 segs

↺ clear

0:00 – 0:31 12 ev × del

● 0:31 – now 4 ev

🐟 Body 1 seg

0:00 – 3:25 38 ev × del

〰 Tail no segments

Playback on Pi

👄 Mouth 🐟 Body 〰 Tail

🤖 Auto body & tail (ignore segments)

▶ Play on Pi

⏹ Stop

⬇ Bake to JSON

▶ mouth + 🤖 auto body/tail: 50 compiled events → Pi

📼

Segment punch-in

Each ⏺ Rec → ⏸ Pause cycle saves a named segment with its exact time range. Later takes auto-replace only the covered window — the rest of the song is untouched.

⏱

Reaction offset

Human reaction time is ~150 ms. The slider shifts every event timestamp backward by that amount so the fish movement lines up with the beat you heard, not the beat you reacted to.

🤖

Auto body & tail

Check this to pull body and tail events straight from the existing .motion.json during Pi Playback. Focus entirely on mouth — the fish moves naturally on its own.

⬇

Bake to JSON

Compiles all segments with newer-wins logic and patches the final {stem}.motion.json. The soundboard picks it up instantly — no restart needed.

🎙 Tuning Custom Songs

The auto-generated motion pipeline is a strong starting point, but for songs where the beat is irregular, the lyrics are dense, or you simply want a specific performance character, the Recording Studio lets you hand-craft every movement. This page explains the full workflow from audio file to finished .motion.json.

Three-phase workflow

Each phase builds on the last. You can stop after any phase and still get a working performance.

1️⃣

Auto-generate

Run generate_motion.py to produce a baseline .motion.json from the audio envelope. Body and tail follow the beat; mouth follows speech-band energy.

2️⃣

Tune PWM live

On the Soundboard, play the track and adjust Drive PWM, pulse ms, hold PWM, close PWM, and PWM carrier frequency until each DOF moves exactly as wanted.

3️⃣

Record & punch-in

Open the Recording Studio, enable Auto body & tail, and focus on mouth only. Use Space / J to record holds in real time. Punch in over any phrase that needs a redo.

Step-by-step guide

Drop the audio file

Copy your WAV, MP3, FLAC, or M4A into openmouth-audio-motion/sounds/. Restart the server (or it auto-reloads if you have --reload). The track appears immediately in both the Soundboard and Recording Studio selectors.

Run the auto-generator

From the project root: python generate_motion.py "Track Name". This writes outputs/Track Name.motion.json. The Soundboard will now show the ⚡ badge and fire all three DOFs when you play the track. This is your baseline — every subsequent step refines it.

Tune PWM parameters on the Soundboard

Play the track on the Soundboard and watch the fish. Adjust these sliders until the mechanics feel right:

Parameter	What it controls	Start point
Mouth Drive PWM	Opening speed & force. Too low → mouth stalls. Too high → impact noise.	96%
Mouth Drive ms	How long the drive pulse lasts before switching to hold. Longer = wider open.	120 ms
Mouth Hold PWM	Minimum duty cycle to keep mouth open against the spring. Just above stall point.	40%
Mouth Close PWM	Reverse pulse strength for the active-close phase.	85%
Body Kick PWM	Initial burst to overcome static friction before travel phase.	71%
Body Travel PWM	Sustained travel duty cycle after the kick.	47%
Body Hold PWM	Stall current to hold body at end-of-travel against spring.	26%
PWM Carrier (body/tail)	Carrier frequency. Lower = more torque ripple but cooler motor. Higher = smoother but more heat.	125 Hz

💡 Tip: Changes apply to the very next motor event — no JSON regeneration needed. Tune live while the fish is moving.

Open the Recording Studio

Click 🎙 Record in the Soundboard nav. Select your track from the dropdown. The waveform loads automatically. The three motor key bindings are shown in the top-right corner: J Mouth · K Body · L Tail.

Enable Auto body & tail

In the Playback on Pi panel, check 🤖 Auto body & tail (ignore segments). This tells Pi Playback to pull body and tail events from the existing .motion.json instead of any manually recorded segments. You can then focus 100% on mouth quality without managing body and tail timing simultaneously.

When to turn this off: Once mouth is finalized, uncheck Auto body & tail, select Body or Tail as the active motor, and record those DOFs with the same punch-in workflow. Each motor's segments compile independently.

Set the reaction offset

Human reaction time between hearing a beat and pressing a key is typically 120–200 ms. The ⏱ Reaction offset slider (default 150 ms) automatically shifts every recorded event timestamp backward by that amount. You press the key when you hear the beat; the system records it as if you pressed it on the beat.

Calibrate by recording a single obvious beat, playing it back, and listening for the lag. Increase the offset if the fish lags the beat; decrease if it leads.

Record your first segment

Press ▶ Play (Audio row) to start the song on the Pi speaker. Press Space or ⏺ Rec when you're ready to start capturing. Hold J whenever you want the mouth open — release to close. Press Space again to save the segment.

Each key-hold records a single event. Short taps produce pulse events (< 200 ms); longer holds produce hold-mode events that keep the mouth open for the full duration.

Punch in over any section

Click the timeline canvas to seek to any position, or use ⏮ Rewind. Press ▶ Play to resume, then Space to start recording again. The new segment will cover only its own time range — all other existing events are preserved. This is the newer-wins compile strategy: later recordings always win within their window.

Workflow pattern: Record the whole song roughly → punch in chorus → punch in verses → punch in any awkward transitions. Each pass is non-destructive outside its own time range.

Preview with Pi Playback

Press ▶ Play on Pi in the Playback panel. This compiles all mouth segments (and pulls auto body/tail from the motion file), pushes the event list to the server, then plays audio + motors in full sync. This is the same path as the Soundboard — what you see is what you ship.

Bake to JSON

When the performance is final, press ⬇ Bake to JSON. This compiles all segments and patches them into outputs/{stem}.motion.json using newer-wins logic. The Soundboard immediately uses the new file — no server restart required. The raw segment files are kept, so you can always re-bake with different settings.

🧠 Concepts to know

Segment

A named recording take with a t_start, t_end, and list of motor events. Stored in {stem}.{motor}.segments.json alongside the audio file.

Newer-wins compile

Segments are sorted by recorded_at. Each later segment erases all events in its time range from earlier segments, then inserts its own. The result is a clean, flat, sorted event list.

Hold vs Pulse events

A key held ≥ 200 ms becomes a hold event — the motor opens and holds until hold_ms elapses. A shorter tap becomes a pulse — a single drive–coast cycle. Both are stored the same way; the dispatcher picks the right motion sequence at play time.

Motion override

Pi Playback pushes the compiled event list into an in-memory _motion_override dict keyed by stem. /play checks this before opening the .motion.json file. Bake writes the override back to disk permanently.

✅ Tips for great recordings

🎧

Use headphones while recording so the speaker audio doesn't bleed into your reaction time calibration. Wired is better — Bluetooth adds 50–200 ms of its own latency.

🔁

Record sections, not the whole song. Tackle the chorus first (it repeats), then verses. Keep segments short — a 30-second segment is easier to redo than a 3-minute one.

👄

Mouth on vowels, not consonants. Open on the vowel onset of each word; close during the consonant gap. This is how the original fish firmware worked and it looks the most natural.

🐟

Watch the fish, not the screen. After a rough pass, sit across the room, press Pi Playback, and judge it from the audience perspective. That's what matters.

⚙️

Retune before you re-record. If the mouth doesn't look right during playback, it's usually a PWM tuning issue, not a timing issue. Go back to the Soundboard sliders first.

Mechanical Motion & PWM

Wire colour guide, pin mapping, tuned PWM parameters, and the mechanical physics behind spring-return stall control. All values are live-tunable in the web controller.

Wiring Diagram

Motor	DOF	Motor Wire	DRV8833 Terminal	Logic Wire	Logic Signal	Signal Wire	Arduino (Legacy)	RPi GPIO	RPi Pin
#1 Body	Front −	Yellow	#2, Out 3	Brown	INT 4	Yellow	D3	25	P22
#1 Body	Front +	White	#2, Out 4	Red	INT 3	Orange	D11	24	P18
#2 Mouth	Mouth −	Black	#1, Out 3	Brown	INT 4	Grey	D5	27 ⚠	P13
#2 Mouth	Mouth +	Red	#1, Out 4	Red	INT 3	Purple	D6	17	P11
#3 Tail	Tail −	Black	#1, Out 1	Brown	INT 2	Blue	D9	23	P16
#3 Tail	Tail +	Orange	#1, Out 2	Red	INT 1	Green	D10	22	P15
GND		GND (shared)						GND	P6

⚠ GPIO 18 — HiFiBerry I2S Conflict

The Arduino schematic assigned Mouth − to Arduino D5 → GPIO 18 (Pin 12). GPIO 18 is the I2S BCLK clock line claimed by the HiFiBerry DAC+ADC driver. Connecting a motor signal here kills audio and prevents PWM from working. On the Pi, Mouth − is remapped to GPIO 27 (Pin 13). Only the PWM signal wire moves — the DRV8833 power rails stay unchanged. GPIOs 18, 19, 20, 21 must never be used for motor control.

DRV8833 H-Bridge Direction Logic

pwm > 0 → forward : IN1 = 0 %, IN2 = pwm %
pwm < 0 → reverse : IN1 = |pwm| %, IN2 = 0 %
pwm = 0 → coast : IN1 = 0 %, IN2 = 0 %

Each motor uses two GPIO pins. Mouth − is on GPIO 27 (remapped from GPIO 18 which is reserved for HiFiBerry I2S BCLK).

Tuned PWM Parameters

Values below were found empirically by testing against real tracks on the BreadVolt 5 V supply. All parameters are live-adjustable from the web controller — changes take effect on the next motor event without restarting the server or regenerating JSON. Carrier frequency is 125 Hz for all three motors (see below).

Motor	Phase	Parameter	Value	Notes
Mouth	Drive	Open PWM	96 %	Overcomes jaw cam stiction quickly
	Drive	Open duration	120 ms	Drives jaw to full-open position
	Hold	Hold PWM	40 %	Minimum duty to stall against return spring
	Hold	Hold duration	300 ms	Sustained vowel window (energy-duration gated)
	Close	Rev PWM	85 %	Assisted close — spring + motor together
	Close	Rev duration / Coast	45 ms / 5 ms	Coast gap prevents current spike on direction reversal
Body	Kick	Kick PWM	55 %	Breaks static friction on body cam
	Kick	Kick duration	45 ms	Short impulse; transitions straight to travel
	Travel	Travel PWM	26 %	Sustains motion to end-of-travel at low current
	Travel	Travel duration	160 ms
	Hold	Hold PWM	26 %	Stall against spring at end-of-travel (sustained accent)
	Hold	Hold duration	275 ms	Release to coast; spring returns body to rest
Tail	Drive	Drive PWM	90 %	Tail needs high initial force (longer lever arm)
	Drive	Drive duration	130 ms
	Hold	Hold PWM	45 %	IOI-gated: fires on long inter-onset gaps ≥ 300 ms
	Hold	Hold duration	300 ms	70 % of IOI gap, capped at 500 ms

Hold / Sustain Classification

The pipeline classifies every motor event as either a short pulse or a longer hold before writing the motion JSON. The runtime dispatches them through different code paths — pulses fire a quick open→coast→close sequence; holds drive to end-of-travel, stall at a reduced duty cycle against the spring, then release. The hold PWM is the key tuned value: too low and the mechanism drifts back mid-hold; too high and current (and heat) accumulates unnecessarily.

Mouth & Body — Energy-Duration Method

At each onset, the algorithm walks forward through the normalised RMS envelope. It measures how long the envelope stays above a floor threshold (mouth: same as detection threshold; body: 50 % of that). If the duration exceeds the hold_threshold_ms (mouth: 300 ms, body: 275 ms), the event is classified as a hold and hold_ms is set to that measured duration, capped at a maximum (mouth: 800 ms, body: 600 ms). Events below threshold are pulses.

Tail — IOI (Inter-Onset Interval) Method

Tail events are driven by beat-onset detection rather than the speech envelope. Hold classification uses the gap to the next tail event: if that gap is ≥ 300 ms (tail_hold_ioi_ms), the current event becomes a hold. The hold duration is set to 70 % of the gap (giving the spring time to return) capped at 500 ms. This keeps the tail raised through long musical phrases rather than snapping back immediately after every beat.

Runtime dispatch: The mode field in the motion JSON ("pulse" or "hold") determines which function runs at playback time. Each DOF has a _pp_* pulse function and a _*_sustain_for(hold_ms) variant. Crucially, both only stop their own motor (never _stop_all()), so concurrent DOF movements survive. The frequency divider and DOF enable/disable checks run before dispatch, so they apply equally to pulse and hold events.

Why Spring Return Beats a Hard Stop

The Gemmy fish uses spring-return cam mechanisms on all three axes, not continuous-rotation or position-controlled servos. Understanding the spring's role is why the hold strategy works safely at all.

❌ Hard Stop (mechanical endstop)

When a DC motor drives into a hard mechanical endstop, shaft velocity drops to zero. Back-EMF, which is proportional to velocity, also drops to zero. With no back-EMF to limit current, the winding resistance alone determines the current draw — often 5–10× the running current. This causes rapid thermal buildup in the motor windings, and the static load puts full torque into the gearbox, risking gear tooth shear or cam binding. There is no stable equilibrium: the motor simply dissipates power as heat until something fails.

✓ Spring Return (stall against restoring force)

The return spring provides a continuously increasing restoring force as the mechanism approaches end-of-travel. At the hold PWM setpoint, motor torque exactly balances the spring's restoring force — a stable equilibrium. Any small perturbation is self-correcting: if the mechanism slips slightly backward, the motor torque now exceeds the spring force and re-drives it forward; if it over-travels, the spring pushes back harder than the motor. Back-EMF is non-zero (the mechanism oscillates slightly), limiting current naturally. On release (PWM = 0 / coast), the spring returns the mechanism to its rest position without any reverse motor command.

Practical implication: The hold PWM must be set just above the minimum duty that prevents the mechanism from drifting back under spring load. Set it too low and the mechanism creeps back (visible as a partial hold); set it too high and unnecessary current heats the coil. For the body and mouth motors on the BreadVolt 5 V supply, 26 % and 40 % respectively were found to be stable stall points.

PWM Carrier Frequency — Why Lower is Quieter

RPi.GPIO software PWM works by toggling GPIO pins on OS-scheduler ticks. The useful range is roughly 125 Hz – 10 kHz; above ~10 kHz the OS jitter makes pulse widths unreliable. All three motors default to 125 Hz.

Inductance smoothing

Motor windings are inductors. At low carrier frequency the on-time is long, so the coil has time to build current smoothly. At high frequency the short pulses cause rapid current rise and fall (high di/dt), creating switching noise and magnetic ripple at the carrier frequency.

Buck converter resonance

The BreadVolt and any upstream supply contain LC output filters. High-frequency PWM switching transients couple onto the supply rail and can excite the LC filter's resonant frequency, producing audible coil whine whose pitch tracks the PWM carrier — particularly loud on the body motor due to lower winding inductance and larger current transients.

125 Hz as the sweet spot

At 125 Hz the switching frequency is below the audible range (~20 Hz–20 kHz), so the carrier itself cannot be heard as a tone. The long PWM period gives the motor inductance maximum time to integrate each pulse into smooth average torque. Empirically, switching noise was lowest at 125 Hz across all three motors on the 5 V battery supply; 1 kHz–10 kHz all produced varying degrees of audible whine.

Hardware fix still recommended

A 470–1 000 µF bulk capacitor placed directly across the DRV8833 VM / GND supply pins would further reduce supply-coupled switching noise by providing local charge storage — smoothing the current spikes that the motor driver draws from the supply rail on each PWM transition. This hardware modification has not yet been applied to the prototype.

Results

Billy Bass in action — three songs, fully animated mouth, body, and tail driven by the OpenMouth pipeline. Recorded on the Raspberry Pi with the HiFiBerry DAC+ ADC for audio output.

🌟

All Star

Smash Mouth · Astro Lounge (1999)

💰

Gold Digger

Kanye West ft. Jamie Foxx · Late Registration (2005)

🤖

Cruel Angel's Thesis

Yoko Takahashi · Neon Genesis Evangelion OST (1995)

OpenMouth AI

Soundboard

Text-to-Speech

Lightweight GPT

Architecture at a glance

Hardware stack

Software stack

Audio-to-motion pipeline

Motor channel map

Mechanical protection layer

DRV8833 pulse policy

Prototype Plan

Bring up the Pi and HiFiBerry

Create the phone UI

Drive one motor first

Add envelope-following

Add body and tail choreography

Add TTS and GPT modes

Fish Risk Register

Laptop audio-to-motion development

1. Create the Anaconda environment

2. Create the project folders

3. Generate a known-good test WAV

4. Run the ASCII digital twin from Terminal

5. Try the motion presets

6. Export the future Pi motion file

7. Run inside Jupyter when tuning parameters

8. Parameters to tune first

Design philosophy

Exact bring-up instructions

1. Flash Raspberry Pi OS

Install Raspberry Pi Imager

Select OS

Preconfigure Wi-Fi and SSH

Flash the SD card

2. Connect to the Raspberry Pi

3. Install the HiFiBerry DAC+ ADC

Power OFF the Pi

Attach the HiFiBerry board

Connect speakers

4. Enable the HiFiBerry overlay

5. Verify the HiFiBerry audio device

6. Install Python dependencies

7. Create a minimal FastAPI app

8. Add a test WAV file

9. Run the FastAPI server

10. Verify the phone UI

Connect phone and Pi to same network

Open browser

Press Play Test Audio

11. First DRV8833 motor bring-up

12. Troubleshooting checklist

13. First success milestone

Python pipeline — modular architecture & song tuning

Library modules

Jupyter notebook workflow

Built-in preset library

All Star — Smash Mouth

Cruel Angel's Thesis

Audio analysis — signal chain & event timelines

Why different thresholds?

Event density per 10-second window

Tuning guide — adapting a new song

How events are triggered — mouth, body, and tail

Threshold intuition

Smoothing and gap intuition

batch_tune.py — one-command batch processor

Usage

Auto-preset keyword rules

Pipeline steps (per file)

build_timeline — simplified event model

🎬 Digital Twin Case Studies

The 8 animation states

"All Star" — Smash Mouth

"Cruel Angel's Thesis" — Yoko Takahashi

Pipeline Comparison

SoundPond — Web Controller

REST API Endpoints

🎙 Recording Studio — /record

🎙 Tuning Custom Songs

🎙 Recording Studio — `/record`