The Core Problem: Parsing Natural Language Into Structured Data
The central challenge was turning free-form text like "cinema $14 and popcorn $5" into structured transaction objects with description, amount, category, type, and currency — all from a single text input.
This meant handling chained expenses, written-out numbers ("thirty dollars" → 30), currency detection from symbols and words, time references ("at 8am"), and income vs. expense classification.
Two-Tier Parsing: Speed Meets Accuracy
I didn't want users staring at a loading spinner every time they typed a character. But I also needed the accuracy of an LLM to handle edge cases. The solution was a two-tier parsing system.
Tier 1: Local Parser (Instant Feedback)
A lightweight regex-based parser runs on every keystroke via useMemo. It handles the common case — a description followed by a currency symbol and amount. It matches patterns like $5, €10, £20, detects categories from keywords (e.g., "uber" → Transport, "netflix" → Bills), and classifies income vs. expense from trigger words.
Tier 2: Gemini API (On Submit)
When the user hits submit, the input goes to a server-side API route that calls Google Gemini with a structured JSON response schema, temperature 0 for deterministic results, and server-side validation before the data reaches the client.
By showing the local parse immediately as a preview and replacing it with the Gemini result on submit, the app feels instant while still being accurate. If the API call fails, the local parser result is already there as the fallback.
Voice Input: 99+ Languages With Auto-Detection
I wanted voice input to work globally, not just in English. Groq's Whisper implementation (whisper-large-v3-turbo) auto-detects the spoken language, so a user can say "almuerzo diez euros" in Spanish or Arabic and get the right transaction.
The pipeline has three layers:
• MediaRecorder Hook — wraps the browser's MediaRecorder API with MIME type negotiation (Chrome/Firefox use audio/webm;codecs=opus, Safari uses audio/mp4) and full permission lifecycle handling.
• Speech Recognition Hook — composes the audio recorder with the transcription API, exposing isListening, isTranscribing, transcript, startListening, and stopListening.
• Server-Side Proxy — sends the audio blob to Groq's API with the server-side API key never exposed to the client.
Bridging Firebase Auth and Convex
I wanted Firebase Auth for its mature Google OAuth flow and Convex for its real-time reactive queries. The bridge lives in ConvexClientProvider — on every Convex request, a custom hook extracts the Firebase JWT and passes it along. Convex verifies it against Firebase's OIDC endpoint.
The result: ctx.auth.getUserIdentity() in any Convex function returns the Firebase UID, and every query/mutation filters by that UID to enforce data isolation at the database layer. No user can read or modify another user's data.
Real-Time Data With Convex
One of the biggest wins of using Convex is reactive queries. The transaction list and user balance are just hooks — useQuery(api.transactions.list). When a transaction is added or deleted, every connected client sees the update instantly — no polling, no WebSocket setup, no cache invalidation.
Rather than recomputing the balance on every page load by summing all transactions, currentBalance is cached on the user document and updated atomically within the same mutation that touches transactions. This keeps reads fast while maintaining consistency.
Currency Conversion
When Gemini detects a non-USD currency — from symbols like € or ¥, or from words like "euros" or "rupees" — the API route converts the amount to USD before storing it using the frankfurter.app API (free, uses ECB reference rates). An in-memory cache with a 1-hour TTL avoids redundant calls. If the conversion API fails, the original amount is stored as-is — better to have approximately right data than no data.
Swipe-to-Delete With Framer Motion
On mobile, transactions support swipe-to-delete and swipe-to-edit using Framer Motion's drag gestures. The key UX detail: it supports both distance-based (dragged past 72px) and velocity-based (fast flick) triggers. This means a slow deliberate swipe and a quick flick both feel natural. Behind the card, two 72px action buttons (edit and delete) are revealed with a spring animation.
Command Processing: Beyond Just Adding Expenses
The app doesn't just add transactions — it understands intent. Typing "delete coffee" matches against recent transactions and removes the right one. "Change burger to $15" finds the burger transaction and updates only the amount, leaving other fields untouched. This is handled by a separate Gemini call with a command-specific schema that classifies intent as log, edit, or delete. Safety limits cap operations at 5 per command to prevent accidental bulk mutations.
Testing
The project has ~130 tests across 7 suites using Vitest, covering:
• LLM response validation (malformed JSON, missing fields, out-of-range values)
• Local parser edge cases (multiple currencies, ambiguous inputs)
• Date utility functions (grouping logic, relative labels)
• Currency conversion (caching, API failure fallbacks)
• Audio recorder (MIME negotiation, permission handling)
• Speech recognition (end-to-end hook behavior)
The convex-test library enables testing Convex backend functions with a mock database, so mutations and queries are tested against the real schema without needing a running Convex deployment.
Key Takeaways
• Hybrid local + cloud parsing gives you the best of both worlds — instant UX with accurate results. The local parser is the fallback, not an afterthought.
• Structured LLM output (via responseSchema) is far more reliable than parsing free-text responses. Combine it with temperature 0 and server-side validation for production-grade reliability.
• Bridging auth providers is straightforward with JWT-based verification. Firebase handles the OAuth complexity, Convex handles the data, and a thin token-passing layer connects them.
• Real-time reactivity from Convex eliminated an entire class of state management bugs. No stale caches, no optimistic updates gone wrong, no manual refetching.
• Voice input with auto-language detection opens the app to a global audience with surprisingly little code — the Whisper model does the heavy lifting.
