Soundboard
Phone buttons trigger stored WAV or MP3 clips on the Pi. This is the fastest path to a reliable demo and the best first build target.
This project reimagines the classic 1998 Big Mouth Billy Bass for 2026: a smartphone-controlled animatronic fish that plays soundboard clips, speaks typed phrases, or delivers lightweight GPT responses, with mouth, body, and tail motion synchronized through a custom audio-to-motion tuning pipeline.
Phone buttons trigger stored WAV or MP3 clips on the Pi. This is the fastest path to a reliable demo and the best first build target.
Typed phrases are converted to speech, saved as audio, analyzed for mouth motion, and played through the fish speakers.
The Pi connects through the phone hotspot, calls a lightweight GPT model, converts the response to speech, and animates the fish from the generated waveform.
The phone is both the remote control and the internet source. The Pi remains the local brain responsible for audio playback, motion extraction, and motor control.
| Subsystem | Selected hardware |
|---|---|
| Compute | Raspberry Pi 4B, 2GB |
| Audio I/O | HiFiBerry DAC+ ADC, primarily used for DAC output in this architecture |
| Motor control | Two DRV8833 dual H-bridge motor drivers with fault/current-protection capability |
| Motors | Existing Gemmy fish mouth, front body, and tail DC motors |
| Speakers | Two lightweight powered speakers or compact amplifier + passive speakers |
| Power | USB-C PD power bank for Pi/audio; separate small battery pack for fish motors |
| Structure | 80/20 aluminum extrusion frame for portable mounting and cable management |
| Layer | Downselected choice |
|---|---|
| Web server | FastAPI for phone UI routes, status, and future WebSocket control |
| Laptop simulation | Anaconda environment with ASCII digital twin, MP3/WAV input, and motion JSON export |
| Audio engine | ALSA-backed WAV playback through the HiFiBerry output |
| TTS | Piper as the default local TTS path; cloud TTS as an optional premium path |
| GPT | Lightweight cloud model accessed through phone hotspot internet |
| Motor API | gpiozero first; pigpio fallback if PWM timing jitter becomes visible |
| Startup | systemd service to boot directly into fish-control mode |
The build treats soundboard clips, TTS, and GPT speech as the same downstream object: an audio file with a motion timeline.
Two dual H-bridge DRV8833 boards provide four motor outputs. Three are used for the fish, leaving one spare channel for future lighting, a second mouth action, or an auxiliary prop. Each axis is driven as a short pulse into the original spring-return cam mechanism, not as a continuously held servo axis.
The stock Gemmy axes are not servo-controlled. Each motor is treated as a small DC actuator pushing a cam or linkage into motion, then released so the original spring can return the mechanism home.
| Protection rule | Design intent |
|---|---|
| Pulse, do not hold | Command short forward pulses only. Avoid continuous drive into mechanical endstops. |
| Coast after each pulse | Set both DRV8833 inputs low after each command so the motor is off and the spring can return the axis. |
| Max-on watchdog | Every motor command is clamped in software even if a higher-level animation request is wrong. |
| Minimum off-time | Allow mechanical recovery before the next pulse, especially for the mouth axis during speech. |
| Fault-aware control | Use DRV8833 nFAULT, if exposed, as a diagnostic signal. A fault means the pulse/PWM/current limit is too aggressive. |
The DRV8833 improves the electrical safety envelope, but it does not turn the fish into a closed-loop servo. Current protection is a backstop; safe pulse choreography is the primary protection.
Install Raspberry Pi OS, configure the HiFiBerry overlay, verify audio playback through the RCA output, and confirm the speaker path.
Build a minimal FastAPI page with three buttons that trigger known local WAV files from the phone browser.
Connect only the mouth motor through one DRV8833 channel. Use short, conservative pulse tests before attempting audio-following motion.
Compute RMS amplitude from the WAV file, detect speech onsets, and map them into short mouth pulses with smoothing, thresholds, clamp limits, and mandatory off-time.
Use phrase starts, beat estimates, or simple timed accents to trigger the body and tail motors without making the fish look overactive.
Start with Piper or cached phrases, then add phone-hotspot GPT mode once the audio and motion systems are reliable.
| Risk | Mitigation |
|---|---|
| Motor noise corrupts audio/Pi power | Use separate motor battery, common ground, short motor wiring, and bulk capacitance near drivers. |
| Mouth motion lags speech | Precompute motion timelines from WAV files and start audio + motor playback from one monotonic clock. |
| PWM jitter | Prototype with gpiozero, then switch the motor layer to pigpio if visible jitter appears. |
| GPT mode needs internet | Use the phone hotspot as uplink. Keep soundboard and local TTS usable without cloud access. |
| DC motors are not position-controlled | Treat each motor as a pulse-driven cam mechanism. Enforce max-on time, minimum off-time, and spring-return recovery. |
| Endstop stall damages pinions/cams | Use DRV8833 current/fault protection as a safety net, but rely primarily on software pulse limits and conservative PWM. |
Before the Raspberry Pi arrives, develop the motion algorithm on macOS as an offline simulator. The working development loop is: audio file → speech-band analysis → safe motor pulse timeline → ASCII fish digital twin → exported motion JSON.
Use macOS text-to-speech to create a deterministic first test clip. This avoids debugging the algorithm with noisy or compressed source audio.
The simulator uses macOS afplay by default for cleaner playback while Matplotlib animates. If audio clips or crackles, reduce playback gain.
| Preset | Use |
|---|---|
| mouth_only | Safest first-pass algorithm mode. Only the mouth axis receives pulse events. |
| gentle | Conservative mouth motion with sparse body and tail accents. Best default. |
| animated | More lively puppet-like behavior. Useful for showmanship, but more aggressive. |
The output JSON is the contract between the offline algorithm and the future Raspberry Pi motor runtime.
Use Jupyter for inspecting variables, event counts, and thresholds. Use Terminal for the most reliable synchronized audio + animation.
Inside a notebook, select Python (openmouth) as the kernel, then either run the script directly:
Or import the analyzer for interactive tuning:
| Symptom | Adjustment |
|---|---|
| Mouth flaps too often | Increase threshold percentile, onset delta, or mouth minimum gap. |
| Mouth misses syllables | Lower threshold percentile or onset delta. |
| Motion looks twitchy | Increase envelope smoothing or mouth minimum gap. |
| Motion looks sluggish | Decrease smoothing or mouth minimum gap. |
| Pulse events seem too aggressive | Lower pulse width and PWM ranges before testing hardware. |
The first milestone is not conversational AI. The first milestone is a robust, portable, phone-controlled fish that can play one audio clip and flap its mouth convincingly. Once that is stable, TTS and GPT become input modes rather than architectural risks.
The goal of the first bring-up milestone is intentionally narrow:
Download Raspberry Pi Imager on your laptop and insert a 32–64 GB microSD card.
Choose Raspberry Pi OS Lite (64-bit). Lite is preferred because this project does not need a desktop environment.
In the advanced settings menu, enable SSH, configure your Wi-Fi network or phone hotspot SSID/password, and set a hostname such as openmouth.
Write the image, insert the card into the Pi, connect Ethernet or Wi-Fi, and power the Pi from the USB-C PD battery.
Update the system:
Never attach the HiFiBerry board while powered.
Mount the DAC+ ADC onto the Pi GPIO header carefully and verify all pins align correctly.
Use RCA or RCA-to-3.5 mm cables into powered speakers.
Edit the Raspberry Pi boot config:
Add these lines near the bottom:
Disable onboard audio:
Reboot:
After reboot:
You should see a HiFiBerry audio device listed.
Create app.py:
Place a small WAV file inside:
Usually your phone hotspot network.
Navigate to http://openmouth.local:5000 or the Pi IP address.
You should hear audio through the speakers connected to the HiFiBerry board.
After audio and phone UI are stable, test only one motor channel first. The goal is to find the minimum pulse that creates visible motion without audible buzzing at the hard stop.
Use coast mode as the default release state:
| Problem | Likely cause |
|---|---|
| No sound | Wrong audio output selected, onboard audio not disabled, or powered speakers not enabled. |
| FastAPI inaccessible from phone | Phone and Pi not on same network, firewall issue, or wrong IP. |
| Audio stutters | Weak power supply or USB battery incapable of stable Pi 4 current delivery. |
| HiFiBerry not detected | Overlay typo, improper seating on GPIO header, or reboot not performed. |
| Wrong audio device | HDMI audio still selected instead of HiFiBerry ALSA device. |
At this point, the system should support:
Smartphone browser → FastAPI button → WAV playback → HiFiBerry → powered speakers.
Do not add motors yet. Verify audio stability and phone UI reliability first.
The offline laptop prototype is built as a proper Python library (openmouth/) with a strict boundary between Pi-safe signal processing and Jupyter-only visualisation tools. Five Jupyter notebooks walk through the full workflow from raw audio to validated motion JSON.
openmouth/ is Pi-safe (numpy/scipy/librosa only). The ASCII twin imports it without circular dependency.
| Notebook | Purpose | Key output |
|---|---|---|
| 01_audio_analysis | Inspect the raw signal chain — waveform, bandpass filter, RMS envelope stages | Visual understanding of why speech-band filtering matters |
| 02_onset_tuning | Interactive ipywidgets sliders for threshold, smoothing, gap, onset_delta | Tuned parameter set for a specific audio file |
| 03_motion_timeline | Stacked timeline plot — waveform, envelope, per-motor event bars (width = pulse_ms, opacity = PWM) | Visual QA before export |
| 04_export_and_validate | Safety validation then export to .motion.json | Pi-ready motion file + event density chart |
| 05_ascii_twin | ASCII fish animates in Jupyter cell in sync with afplay audio | Real-time sanity check of the full motor timeline |
Six presets ship with the library. Each is a Preset dataclass instance registered in PRESETS. batch_tune.py selects one automatically by matching filename keywords.
| Preset | threshold | smooth | mouth_gap | Best for |
|---|---|---|---|---|
| hiphop | 0.54 | 4 | 0.13 s | Default — rap, hip-hop, compressed pop vocals |
| pop | 0.45 | 7 | 0.10 s | Sustained pop vocals (All Star, I Will Survive, This Love) |
| jpop | 0.32 | 5 | 0.10 s | Dynamic J-pop / anime (Cruel Angel's Thesis) |
| ballad | 0.22 | 9 | 0.12 s | Slow ballads and acoustic tracks with wide dynamics |
| edm | 0.50 | 3 | 0.08 s | Electronic / dance — very fast transient response |
| speech | 0.18 | 6 | 0.09 s | Speech-heavy tracks, podcasts, McDonald's jingle |
A maximally compressed pop track. The speech-band envelope stays above 0.20 for ~80% of the clip — the default gentle threshold of 0.18 barely crosses upward, producing only ~0.5 mouth events/second. Raising to 0.45 puts it in the zone where the envelope fluctuates with syllable energy.
| Parameter | Value | Rationale |
|---|---|---|
| threshold | 0.45 | Envelope stays high; need a higher crossing point |
| smoothing_frames | 7 | Standard — keeps envelope clean |
| body_min_gap_s | 0.50 s | Body bobs at ~104 BPM beat rate |
| tail_min_gap_s | 1.00 s | One sweep per bar |
| onset_delta | 0.07 | Standard onset sensitivity |
A traditionally mastered J-pop track with real dynamic range. The envelope only exceeds 0.45 for ~31% of the clip, so pop_song starves the vocal sections. The first ~20 s is the instrumental organ intro — fewer events there is correct and expected.
| Parameter | Value | Rationale |
|---|---|---|
| threshold | 0.32 | Peak crossing zone for this song's dynamic range |
| smoothing_frames | 5 | Lighter — crisper syllable tracking |
| body_min_gap_s | 0.55 s | ~130 BPM; slightly sparser body bobs |
| tail_min_gap_s | 1.00 s | One sweep per bar |
| onset_delta | 0.06 | More sensitive to J-pop percussion transients |
Top row: raw waveform amplitude. Bottom row: normalized speech-band envelope (orange fill), threshold (red dashed), mouth events (orange lines), body events (purple, bottom 40%), tail events (teal, bottom 25%).
Each curve shows the percentage of time the normalized envelope exceeds a given threshold. All Star (orange) stays high everywhere — a direct result of heavy pop compression. Cruel Angel's Thesis (purple) has a steeper drop-off, reflecting its traditional dynamic range. The shaded bands mark each song's tuned threshold, sitting at the inflection point where crossings are most meaningful.
Mouth, body, and tail events distributed across each 60-second clip. The low count in the first two windows of Cruel Angel's Thesis is intentional — those are the instrumental intro bars before the vocalist enters. Both songs maintain consistent density across their vocal sections.
The key diagnostic is the envelope distribution: check what percentage of time the envelope exceeds various thresholds, then pick the value at the inflection point where the curve bends steeply — that is where crossings are most responsive to actual vocal energy.
| Symptom | Diagnosis | Fix |
|---|---|---|
| Too few mouth events (<0.8/s) | Threshold too high for this song's dynamic range | Lower threshold toward the envelope's inflection point |
| Too many mouth events (>3.5/s) | Threshold too low — firing on background energy | Raise threshold or increase smoothing_frames |
| Events bunched, then silent | Uneven dynamics (intro vs chorus) | Use norm_percentile=95 for more aggressive ceiling, or clip to vocal section only |
| Events feel jittery / chattery | Smoothing too light for this style | Increase smoothing_frames (5 → 7 → 10) |
| Syllables blurring together | Smoothing too heavy | Decrease smoothing_frames; increase mouth_min_gap_s slightly |
| Body / tail too active | Onset detector too sensitive | Increase onset_delta (0.06 → 0.08 → 0.10) |
The mouth and the body/tail motors are driven by two completely independent audio features. Understanding the difference is the key to tuning them independently.
The audio is bandpass-filtered (≈80–3000 Hz, the speech band) to remove low rumble and high noise. The RMS energy is computed in short overlapping windows (~23 ms), smoothed by a moving average (smoothing_frames), and normalized so the loudest moment in the track equals 1.0.
The mouth fires on upward crossings of threshold — the exact frame where the envelope rises from below to above the value. A sustained loud note produces exactly one event at the moment it starts, not a continuous stream.
After all crossings are collected, a gap filter (mouth_min_gap_s) discards any crossing that arrives too soon after the previous one, preventing rapid double-fires on staccato syllables.
Each crossing becomes a MotionEvent with a randomly drawn pulse_ms (e.g. 60–115 ms) and duty cycle. On the physical fish the motor drives for exactly that many milliseconds, then power cuts and a spring returns the jaw to closed. Duration of the open state is controlled entirely by pulse_ms and the spring — there is no separate "close" command.
Body and tail events are driven by a completely different signal: librosa's spectral flux onset detector, which measures how quickly the frequency content changes. It is sensitive to percussive transients — drum hits, consonant bursts, strong beat attacks — not to sustained loudness.
onset_delta sets the minimum strength a local peak must exceed its local average. Lower delta → more onsets detected → denser body/tail events. Both motors draw from the same onset pool; what differentiates them is only the gap filter applied afterward.
[0.1, 0.3, 0.5, 0.9, 1.1, 1.8] s:[0.1, 0.9, 1.8][0.1, 1.1]Like the mouth, each event fires for a randomly drawn pulse_ms and returns to rest when power cuts. The fin position (up vs. down) is a deterministic alternation in the digital twin; on the physical fish both fins have a single motor each, so direction is not controllable — only duration and duty cycle.
Mouth and body/tail are driven by entirely independent features of the audio. A quiet verse with active drums can produce dense body/tail events with a completely closed mouth. A sustained loud note can hold the mouth open with no body or tail activity. Tuning them never conflicts — adjust threshold and mouth_min_gap_s for the mouth; adjust onset_delta and the two gap parameters for body and tail.
threshold is relative to the track's loudest moment after normalization. The envelope distribution table printed by tune_song.py tells you what percentage of the clip sits above each candidate value — use it to find the inflection point.
| Scenario | What you see | Threshold guidance |
|---|---|---|
| Heavily compressed pop (All Star) | Envelope above 0.20 for ~80% of the track | Need a high threshold (≈0.45–0.54) to catch only syllable peaks |
| Dynamic J-pop (Cruel Angel) | Envelope above 0.45 for only ~31% of the track | Lower threshold (≈0.30–0.35) to avoid starving vocal moments |
| Instrumental section firing | Mouth opens when nobody is singing | Raise threshold, or run with --no-vad removed to re-enable VAD gating |
| Vocals barely trigger mouth | Mouth stays closed through singing | Lower threshold; also check smoothing_frames isn't washing out syllable peaks |
smoothing_frames sets the moving-average window on the RMS envelope before threshold comparison. Larger windows blur rapid syllables into a single sustained lump; smaller windows track individual syllables but also track noise.
| Parameter | Lower value | Higher value |
|---|---|---|
smoothing_frames | Crisp per-syllable tracking (fast speech, J-pop) | Smooth phrase-level shapes (slow ballads, instruments) |
mouth_min_gap_s | More events; can blur adjacent syllables together | Fewer events; sparser but cleaner open-close cycles |
onset_delta | More body/tail events; tracks subtle transients | Fewer events; only strong beats and accents fire |
body_min_gap_s | Body bobs at beat rate | Body bobs at bar rate |
tail_min_gap_s | Tail moves more frequently (approaching body density) | Tail sweeps slowly, only on the strongest beats |
batch_tune.py is the primary workflow for generating motion files. Run it once from openmouth-audio-motion/ to process every audio file in sounds/ and write .motion.json files to outputs/. The SoundPond web app picks up the results immediately — no server restart required.
| Keyword in filename | Preset selected |
|---|---|
| mcdonalds, speech, podcast | speech |
| cruelangel, jpop, anime | jpop |
| lionking, musical, broadway | pop |
| survive, thislove | pop |
| ballad, acoustic, slow | ballad |
| edm, electronic, techno | edm |
| (anything else) | hiphop |
All events are pure discrete pulses. The timeline builder applies no hold/sustain classification, no tempo scaling, no anticipation offsets, and no phrase-boundary suppression — features that were found to cause gear stalling and unnatural motion. Each crossing or onset maps directly to one MotionEvent.
| Motor | Source | Mode | Spacing enforced by |
|---|---|---|---|
| Mouth | Upward envelope crossings of threshold | pulse only | mouth_min_gap_s |
| Body | Broadband librosa onset detector | pulse only | body_min_gap_s |
| Tail | Same broadband onset pool (sparser gap) | pulse only | tail_min_gap_s |
The fish digital twin is generated in the jupyter notebook for crisper tuning development. Here are two real songs rendered through the full pipeline — audio analysis → motion timeline → image-frame animation → MP4. Each clip shows 30 seconds of the digital twin animating with all 8 body states active.
Every animation frame is one of these 8 images. The pipeline picks the correct image at each timestamp based on which motors are currently active. Front fin and tail fin each have two states (flat / flapped); mouth has two states (closed / open) — giving 2 × 2 × 2 = 8 combinations.
Clip: 00:10 – 00:40 · Preset: hiphop · Mode: full (all 8 states)
| Smoothing frames | 4 |
| Threshold | 0.54 |
| Mouth min gap | 0.13 s |
| Body min gap | 0.50 s |
| Tail min gap | 1.00 s |
| Onset delta | 0.07 |
All Star is maximally compressed: the RMS envelope stays above 0.20 for ~80% of the track. A high threshold of 0.54 sits at the inflection point where the envelope actually fluctuates with lyric energy, preventing the mouth from firing constantly on flat sections. Body bobs at beat rate (~2/s), tail sweeps at bar rate (~1/s).
Clip: 00:25 – 00:55 · Preset: jpop · Mode: full (all 8 states)
| Smoothing frames | 5 |
| Threshold | 0.32 |
| Mouth min gap | 0.10 s |
| Body min gap | 0.55 s |
| Tail min gap | 1.00 s |
| Onset delta | 0.06 |
J-pop has real dynamic range unlike maximally-compressed Western pop — the envelope only exceeds 0.45 for ~31% of the track. Using pop_song's threshold of 0.45 would starve the vocals. A lower threshold of 0.32 with lighter smoothing (5 frames) preserves crisp Japanese syllable articulation. Body bobs at ~130 BPM beat structure.
Same 60-second clip processed through multiple presets — event counts per motor illustrate how preset parameters shape animation density.
| Preset | Song character | Mouth | Body | Tail | Total / 60s |
|---|---|---|---|---|---|
hiphop |
Hip-hop / rap, sustained | 80 | 97 | 61 | 238 |
jpop |
J-pop / anime, dynamic range | 102 | 61 | 33 | 196 |
pop_song |
Western pop, compressed | 307 | 197 | 140 | 644 |
animated |
High-energy / expressive | 196 | 205 | 205 | 606 |
Event counts from the full Cruel Angel's Thesis track (252s) across presets. Higher isn't always better — the goal is matching the song's rhythmic character, not maximizing raw count.
FastAPI server running on the Raspberry Pi at http://openbass.local:8000.
Any device on the same Wi-Fi can open the page. SoundPond is a SoundCloud-inspired interface
that auto-discovers every audio file in sounds/ and displays waveform previews with motion-sync status.
sounds/ on startup. Drop a file in, restart, it appears. WAV · MP3 · FLAC · AIFF · M4A..motion.json. Play fires audio + all three motors in perfect sync.| Method | Path | Description |
|---|---|---|
| GET | /sounds | List all discovered tracks with label, filename, and has_motion flag |
| GET | /waveform/{stem} | 200-bar RMS waveform + duration for any audio file. Cached after first decode. Supports WAV (stdlib), librosa, or ffmpeg fallback. |
| GET | /play?file={name} | Play audio + fire motor timeline if .motion.json exists. Kills any current playback first. |
| GET | /status | Returns {"playing", "elapsed", "duration"} — polled every second to drive the waveform scrubber. |
| GET | /stop | Kill all playback and reset motor state immediately. |
| GET | /tts?text={phrase} | Synthesize speech via Piper TTS, compute motion timeline on the fly, play with mouth sync. |
| GET | /set_pwm | Live-update any motor parameter (drive_pwm, pulse_ms, coast_ms, freq, divider). No restart required. |
/record
Segment-based punch-in recorder. Play audio on the Pi, hold a key to fire each DOF, release to save the event. Each take is stored as a time-stamped segment; overlapping takes use a newer-wins compile strategy. Bake to a .motion.json when satisfied.
.motion.json during Pi Playback. Focus entirely on mouth — the fish moves naturally on its own.{stem}.motion.json. The soundboard picks it up instantly — no restart needed.
The auto-generated motion pipeline is a strong starting point, but for songs where the beat is irregular, the lyrics are dense, or you simply want a specific performance character,
the Recording Studio lets you hand-craft every movement. This page explains the full workflow from audio file to finished .motion.json.
Each phase builds on the last. You can stop after any phase and still get a working performance.
generate_motion.py to produce a baseline .motion.json from the audio envelope. Body and tail follow the beat; mouth follows speech-band energy.Copy your WAV, MP3, FLAC, or M4A into openmouth-audio-motion/sounds/. Restart the server (or it auto-reloads if you have --reload). The track appears immediately in both the Soundboard and Recording Studio selectors.
From the project root: python generate_motion.py "Track Name". This writes outputs/Track Name.motion.json. The Soundboard will now show the ⚡ badge and fire all three DOFs when you play the track. This is your baseline — every subsequent step refines it.
Play the track on the Soundboard and watch the fish. Adjust these sliders until the mechanics feel right:
| Parameter | What it controls | Start point |
|---|---|---|
| Mouth Drive PWM | Opening speed & force. Too low → mouth stalls. Too high → impact noise. | 96% |
| Mouth Drive ms | How long the drive pulse lasts before switching to hold. Longer = wider open. | 120 ms |
| Mouth Hold PWM | Minimum duty cycle to keep mouth open against the spring. Just above stall point. | 40% |
| Mouth Close PWM | Reverse pulse strength for the active-close phase. | 85% |
| Body Kick PWM | Initial burst to overcome static friction before travel phase. | 71% |
| Body Travel PWM | Sustained travel duty cycle after the kick. | 47% |
| Body Hold PWM | Stall current to hold body at end-of-travel against spring. | 26% |
| PWM Carrier (body/tail) | Carrier frequency. Lower = more torque ripple but cooler motor. Higher = smoother but more heat. | 125 Hz |
💡 Tip: Changes apply to the very next motor event — no JSON regeneration needed. Tune live while the fish is moving.
Click 🎙 Record in the Soundboard nav. Select your track from the dropdown. The waveform loads automatically. The three motor key bindings are shown in the top-right corner: J Mouth · K Body · L Tail.
In the Playback on Pi panel, check 🤖 Auto body & tail (ignore segments). This tells Pi Playback to pull body and tail events from the existing .motion.json instead of any manually recorded segments. You can then focus 100% on mouth quality without managing body and tail timing simultaneously.
Human reaction time between hearing a beat and pressing a key is typically 120–200 ms. The ⏱ Reaction offset slider (default 150 ms) automatically shifts every recorded event timestamp backward by that amount. You press the key when you hear the beat; the system records it as if you pressed it on the beat.
Calibrate by recording a single obvious beat, playing it back, and listening for the lag. Increase the offset if the fish lags the beat; decrease if it leads.
Press ▶ Play (Audio row) to start the song on the Pi speaker. Press Space or ⏺ Rec when you're ready to start capturing. Hold J whenever you want the mouth open — release to close. Press Space again to save the segment.
Each key-hold records a single event. Short taps produce pulse events (< 200 ms); longer holds produce hold-mode events that keep the mouth open for the full duration.
Click the timeline canvas to seek to any position, or use ⏮ Rewind. Press ▶ Play to resume, then Space to start recording again. The new segment will cover only its own time range — all other existing events are preserved. This is the newer-wins compile strategy: later recordings always win within their window.
Press ▶ Play on Pi in the Playback panel. This compiles all mouth segments (and pulls auto body/tail from the motion file), pushes the event list to the server, then plays audio + motors in full sync. This is the same path as the Soundboard — what you see is what you ship.
When the performance is final, press ⬇ Bake to JSON. This compiles all segments and patches them into outputs/{stem}.motion.json using newer-wins logic. The Soundboard immediately uses the new file — no server restart required. The raw segment files are kept, so you can always re-bake with different settings.
A named recording take with a t_start, t_end, and list of motor events. Stored in {stem}.{motor}.segments.json alongside the audio file.
Segments are sorted by recorded_at. Each later segment erases all events in its time range from earlier segments, then inserts its own. The result is a clean, flat, sorted event list.
A key held ≥ 200 ms becomes a hold event — the motor opens and holds until hold_ms elapses. A shorter tap becomes a pulse — a single drive–coast cycle. Both are stored the same way; the dispatcher picks the right motion sequence at play time.
Pi Playback pushes the compiled event list into an in-memory _motion_override dict keyed by stem. /play checks this before opening the .motion.json file. Bake writes the override back to disk permanently.
Use headphones while recording so the speaker audio doesn't bleed into your reaction time calibration. Wired is better — Bluetooth adds 50–200 ms of its own latency.
Record sections, not the whole song. Tackle the chorus first (it repeats), then verses. Keep segments short — a 30-second segment is easier to redo than a 3-minute one.
Mouth on vowels, not consonants. Open on the vowel onset of each word; close during the consonant gap. This is how the original fish firmware worked and it looks the most natural.
Watch the fish, not the screen. After a rough pass, sit across the room, press Pi Playback, and judge it from the audience perspective. That's what matters.
Retune before you re-record. If the mouth doesn't look right during playback, it's usually a PWM tuning issue, not a timing issue. Go back to the Soundboard sliders first.
Wire colour guide, pin mapping, tuned PWM parameters, and the mechanical physics behind spring-return stall control. All values are live-tunable in the web controller.
| Motor | DOF | Motor Wire | DRV8833 Terminal | Logic Wire | Logic Signal | Signal Wire | Arduino (Legacy) | RPi GPIO | RPi Pin |
|---|---|---|---|---|---|---|---|---|---|
| #1 Body | Front − | Yellow | #2, Out 3 | Brown | INT 4 | Yellow | D3 | 25 | P22 |
| Front + | White | #2, Out 4 | Red | INT 3 | Orange | D11 | 24 | P18 | |
| #2 Mouth | Mouth − | Black | #1, Out 3 | Brown | INT 4 | Grey | D5 | 27 ⚠ | P13 |
| Mouth + | Red | #1, Out 4 | Red | INT 3 | Purple | D6 | 17 | P11 | |
| #3 Tail | Tail − | Black | #1, Out 1 | Brown | INT 2 | Blue | D9 | 23 | P16 |
| Tail + | Orange | #1, Out 2 | Red | INT 1 | Green | D10 | 22 | P15 | |
| GND | GND (shared) | GND | P6 | ||||||
⚠ GPIO 18 — HiFiBerry I2S Conflict
The Arduino schematic assigned Mouth − to Arduino D5 → GPIO 18 (Pin 12). GPIO 18 is the I2S BCLK clock line claimed by the HiFiBerry DAC+ADC driver. Connecting a motor signal here kills audio and prevents PWM from working. On the Pi, Mouth − is remapped to GPIO 27 (Pin 13). Only the PWM signal wire moves — the DRV8833 power rails stay unchanged. GPIOs 18, 19, 20, 21 must never be used for motor control.
DRV8833 H-Bridge Direction Logic
pwm > 0 → forward : IN1 = 0 %, IN2 = pwm %
pwm < 0 → reverse : IN1 = |pwm| %, IN2 = 0 %
pwm = 0 → coast : IN1 = 0 %, IN2 = 0 %
Each motor uses two GPIO pins. Mouth − is on GPIO 27 (remapped from GPIO 18 which is reserved for HiFiBerry I2S BCLK).
Values below were found empirically by testing against real tracks on the BreadVolt 5 V supply. All parameters are live-adjustable from the web controller — changes take effect on the next motor event without restarting the server or regenerating JSON. Carrier frequency is 125 Hz for all three motors (see below).
| Motor | Phase | Parameter | Value | Notes |
|---|---|---|---|---|
| Mouth | Drive | Open PWM | 96 % | Overcomes jaw cam stiction quickly |
| Open duration | 120 ms | Drives jaw to full-open position | ||
| Hold | Hold PWM | 40 % | Minimum duty to stall against return spring | |
| Hold duration | 300 ms | Sustained vowel window (energy-duration gated) | ||
| Close | Rev PWM | 85 % | Assisted close — spring + motor together | |
| Rev duration / Coast | 45 ms / 5 ms | Coast gap prevents current spike on direction reversal | ||
| Body | Kick | Kick PWM | 55 % | Breaks static friction on body cam |
| Kick duration | 45 ms | Short impulse; transitions straight to travel | ||
| Travel | Travel PWM | 26 % | Sustains motion to end-of-travel at low current | |
| Travel duration | 160 ms | |||
| Hold | Hold PWM | 26 % | Stall against spring at end-of-travel (sustained accent) | |
| Hold duration | 275 ms | Release to coast; spring returns body to rest | ||
| Tail | Drive | Drive PWM | 90 % | Tail needs high initial force (longer lever arm) |
| Drive duration | 130 ms | |||
| Hold | Hold PWM | 45 % | IOI-gated: fires on long inter-onset gaps ≥ 300 ms | |
| Hold duration | 300 ms | 70 % of IOI gap, capped at 500 ms |
The pipeline classifies every motor event as either a short pulse or a longer hold before writing the motion JSON. The runtime dispatches them through different code paths — pulses fire a quick open→coast→close sequence; holds drive to end-of-travel, stall at a reduced duty cycle against the spring, then release. The hold PWM is the key tuned value: too low and the mechanism drifts back mid-hold; too high and current (and heat) accumulates unnecessarily.
Mouth & Body — Energy-Duration Method
At each onset, the algorithm walks forward through the normalised RMS envelope.
It measures how long the envelope stays above a floor threshold (mouth: same as detection threshold;
body: 50 % of that). If the duration exceeds the hold_threshold_ms (mouth: 300 ms, body: 275 ms),
the event is classified as a hold and hold_ms is set to that measured duration,
capped at a maximum (mouth: 800 ms, body: 600 ms). Events below threshold are pulses.
Tail — IOI (Inter-Onset Interval) Method
Tail events are driven by beat-onset detection rather than the speech envelope.
Hold classification uses the gap to the next tail event: if that gap is
≥ 300 ms (tail_hold_ioi_ms), the current event becomes a hold.
The hold duration is set to 70 % of the gap (giving the spring time to return)
capped at 500 ms. This keeps the tail raised through long musical phrases rather than
snapping back immediately after every beat.
mode field in the motion JSON ("pulse" or "hold")
determines which function runs at playback time. Each DOF has a _pp_* pulse function and a
_*_sustain_for(hold_ms) variant. Crucially, both only stop their own motor
(never _stop_all()), so concurrent DOF movements survive.
The frequency divider and DOF enable/disable checks run before dispatch,
so they apply equally to pulse and hold events.
The Gemmy fish uses spring-return cam mechanisms on all three axes, not continuous-rotation or position-controlled servos. Understanding the spring's role is why the hold strategy works safely at all.
❌ Hard Stop (mechanical endstop)
When a DC motor drives into a hard mechanical endstop, shaft velocity drops to zero. Back-EMF, which is proportional to velocity, also drops to zero. With no back-EMF to limit current, the winding resistance alone determines the current draw — often 5–10× the running current. This causes rapid thermal buildup in the motor windings, and the static load puts full torque into the gearbox, risking gear tooth shear or cam binding. There is no stable equilibrium: the motor simply dissipates power as heat until something fails.
✓ Spring Return (stall against restoring force)
The return spring provides a continuously increasing restoring force as the mechanism approaches end-of-travel. At the hold PWM setpoint, motor torque exactly balances the spring's restoring force — a stable equilibrium. Any small perturbation is self-correcting: if the mechanism slips slightly backward, the motor torque now exceeds the spring force and re-drives it forward; if it over-travels, the spring pushes back harder than the motor. Back-EMF is non-zero (the mechanism oscillates slightly), limiting current naturally. On release (PWM = 0 / coast), the spring returns the mechanism to its rest position without any reverse motor command.
RPi.GPIO software PWM works by toggling GPIO pins on OS-scheduler ticks. The useful range is roughly 125 Hz – 10 kHz; above ~10 kHz the OS jitter makes pulse widths unreliable. All three motors default to 125 Hz.
Motor windings are inductors. At low carrier frequency the on-time is long, so the coil has time to build current smoothly. At high frequency the short pulses cause rapid current rise and fall (high di/dt), creating switching noise and magnetic ripple at the carrier frequency.
The BreadVolt and any upstream supply contain LC output filters. High-frequency PWM switching transients couple onto the supply rail and can excite the LC filter's resonant frequency, producing audible coil whine whose pitch tracks the PWM carrier — particularly loud on the body motor due to lower winding inductance and larger current transients.
At 125 Hz the switching frequency is below the audible range (~20 Hz–20 kHz), so the carrier itself cannot be heard as a tone. The long PWM period gives the motor inductance maximum time to integrate each pulse into smooth average torque. Empirically, switching noise was lowest at 125 Hz across all three motors on the 5 V battery supply; 1 kHz–10 kHz all produced varying degrees of audible whine.
Hardware fix still recommended
A 470–1 000 µF bulk capacitor placed directly across the DRV8833 VM / GND supply pins would further reduce supply-coupled switching noise by providing local charge storage — smoothing the current spikes that the motor driver draws from the supply rail on each PWM transition. This hardware modification has not yet been applied to the prototype.
Billy Bass in action — three songs, fully animated mouth, body, and tail driven by the OpenMouth pipeline. Recorded on the Raspberry Pi with the HiFiBerry DAC+ ADC for audio output.
Smash Mouth · Astro Lounge (1999)
Kanye West ft. Jamie Foxx · Late Registration (2005)
Yoko Takahashi · Neon Genesis Evangelion OST (1995)