From Siri to Gemini: Building Voice Assistant Integrations in React Native
AIVoiceIntegration

From Siri to Gemini: Building Voice Assistant Integrations in React Native

UUnknown
2026-03-01
10 min read
Advertisement

A practical 2026 guide to integrate Gemini-class LLMs and Siri into React Native: streaming voice, STT choices, native bridges, and privacy best practices.

Hook — the pain we solve first

Building a fast, private, and reliable voice assistant in React Native is hard: long voice pipelines, brittle native bridges, and uncertainty about what happens to audio when you call platform services like Siri or Google Assistant. This guide gives a modern, pragmatic blueprint for integrating LLM-powered assistants (think Gemini) into React Native apps — with streaming responses, speech-to-text choices, native bridges, fallback strategies, and real privacy guardrails for 2026.

Executive summary — what you'll get

By the end of this article you will have:

  • Clear wiring options: on-device STT, platform STT (Apple/Android), or cloud STT -> Gemini.
  • Working TypeScript examples for React Native (managed Expo + native module patterns) showing audio capture, WebSocket streaming, and rendering token-by-token responses.
  • Fallback patterns for offline, low-bandwidth, and platform-assistant handoff (Siri/Google Assistant).
  • Practical privacy and compliance controls when routing audio through Apple/Google tech or third-party clouds.

Why this matters in 2026

Platform assistants evolved rapidly through late 2024–2025. Notably, Apple’s move to integrate Google’s Gemini into Siri (announced publicly by early 2026) changed expectations: assistants now return rich, multimodal, and streaming outputs. For mobile apps that embed assistant features, that means two immediate demands:

  1. Low-latency streaming so partial LLM outputs arrive as the user speaks.
  2. Clear privacy boundaries because platform-level processing may route audio to third parties.

React Native teams must choose where to do STT, where to send audio, and how to handle token streams without sacrificing UX or compliance.

High-level architecture patterns

Pick one of three common architectures depending on product goals:

Option A — On-device STT + LLM in cloud (best privacy/latency balance)

  • Record audio in the app.
  • Use on-device STT (if available) to get text locally.
  • Send text to Gemini API for responses or follow-up multimodal generation.
  • Use TTS for spoken replies (local or cloud).

Option B — Platform STT (Siri/Google) + LLM cloud

  • Leverage Siri/Android voice input to reduce battery & CPU usage.
  • Platform returns text; app sends text to LLM.
  • Note privacy: platform policies determine what gets logged/retained.

Option C — Raw audio -> Cloud STT/LLM streaming

  • Stream encoded audio to your server or directly to a cloud STT + Gemini pipeline.
  • Enables full control, advanced audio features, and end-to-end encryption (if you provide it).
  • Higher cost and compliance surface area.

Key trade-offs

  • Latency: on-device STT + local TTS > platform STT > cloud streaming depending on network & compute.
  • Privacy: on-device preferred; platform STT may route to vendor clouds; raw cloud audio has the largest attack surface.
  • Complexity: platform STT easiest; full streaming LLM pipeline is most complex but most flexible.

React Native patterns: code-first

Below are practical, copyable snippets. The examples use TypeScript and show an Expo-managed approach where possible, plus notes for native modules where needed.

1) Recording audio in Expo (managed) — TypeScript

Use expo-av for audio capture. Chunk and send PCM/Opus frames via WebSocket for low-latency streaming.

// recordService.ts
import { Audio } from 'expo-av'

export async function startRecording(onChunk: (base64: string) => void) {
  await Audio.requestPermissionsAsync()
  await Audio.setAudioModeAsync({ allowsRecordingIOS: true })

  const recording = new Audio.Recording()
  await recording.prepareToRecordAsync(Audio.RECORDING_OPTIONS_PRESET_LOW_QUALITY)
  await recording.startAsync()

  // Poll small blobs (expo doesn't give raw frames easily in managed)
  // For production, eject and use a native module to stream PCM.
  return recording
}

export async function stopRecording(recording: Audio.Recording) {
  await recording.stopAndUnloadAsync()
  const uri = recording.getURI()
  return uri
}

Note: Expo-managed flow is fine for simple capture. For low-latency streaming you likely need a native module that yields raw PCM/Opus frames (iOS AVAudioEngine / Android AudioRecord).

2) Native module pattern for streaming audio frames (iOS example)

Implement a small TurboModule or React Native bridge that yields Opus/PCM frames as ArrayBuffer. Expose a TypeScript interface and deliver frames to a WebSocket.

// types.d.ts
export interface VoiceStreamModule {
  startStream(): void
  stopStream(): void
  onFrame(listener: (base64Chunk: string) => void): void
}

// Usage in RN
import { NativeModules, NativeEventEmitter } from 'react-native'
const { VoiceStream } = NativeModules
const emitter = new NativeEventEmitter(VoiceStream)

emitter.addListener('onFrame', (payload: { chunk: string }) => {
  ws.send(JSON.stringify({ type: 'audio.chunk', data: payload.chunk }))
})

VoiceStream.startStream()

Why native? Because reliable low-latency capture needs smaller audio buffers and control over encoding (Opus recommended).

3) WebSocket streaming to a proxy server (Node.js) that forwards to Gemini

High-level idea: the app streams base64 audio chunks over a WebSocket. The server decodes and forwards to a cloud STT or directly to Gemini’s streaming ingestion (many LLM providers support streaming via gRPC/HTTP2). Keep the connection alive for token streaming back to the client.

// server/ws-proxy.js (simplified)
const WebSocket = require('ws')
const { forwardToGemini } = require('./gemini-client') // handles gRPC/HTTP2 streaming

const wss = new WebSocket.Server({ port: 8080 })

wss.on('connection', (ws) => {
  const geminiStream = forwardToGemini()

  ws.on('message', async (msg) => {
    const { type, data } = JSON.parse(msg)
    if (type === 'audio.chunk') {
      const audioBuffer = Buffer.from(data, 'base64')
      geminiStream.writeAudioChunk(audioBuffer)
    }
    if (type === 'end') {
      geminiStream.endAudio()
    }
  })

  geminiStream.on('token', (token) => {
    ws.send(JSON.stringify({ type: 'token', data: token }))
  })

  ws.on('close', () => {
    geminiStream.cancel()
  })
})

The gemini-client implements the protocol for the provider. Many providers return incremental tokens via events; forward those as they arrive to render token-by-token UI in the app.

4) Rendering streaming responses in RN

// useTokenStream.ts
import { useState, useEffect } from 'react'

export function useTokenStream(wsUrl: string) {
  const [tokens, setTokens] = useState([])

  useEffect(() => {
    const ws = new WebSocket(wsUrl)
    ws.onmessage = (e) => {
      const msg = JSON.parse(e.data)
      if (msg.type === 'token') setTokens((t) => [...t, msg.data])
    }
    return () => ws.close()
  }, [wsUrl])

  return { tokens }
}

Display tokens immediately for a snappy assistant. For voice, convert tokens to TTS or play audio chunks if the LLM returns them.

Handoff strategies — when to use Siri / Google Assistant vs your in-app assistant

Handoff is a high-value UX: if your app can’t complete an intent, hand off to Siri/Google Assistant in a way that preserves context.

  • Siri Shortcuts: Expose intents the system can call; supply context via NSUserActivity or SiriKit. For deep assistant queries, invoke the system assistant when the user expects device-level control (calendar, system settings).
  • Intent fallback: When network drops or your pipeline fails, offer to open system assistant with a prefilled query (via URL schemes or intent APIs).
  • Preserve privacy: warn users before sending context to platform assistants.

Privacy and compliance — actionable rules for 2026

Voice data is sensitive. Platform changes through 2025–2026 made vendor routing visible — Apple using Gemini means audio may touch Google infrastructure even via Siri. That places a premium on transparency and controls.

Checklist you must ship

  • Explicit consent screen: Explain whether audio stays on-device, goes to Apple/Google for STT, or to third-party clouds (Gemini).
  • Granular toggles: Let users choose: on-device STT, platform STT, or cloud STT.
  • Data residency: If you send audio to servers, allow regional endpoints and deletion on request.
  • Minimum retention: Log only what you need; store transcripts not raw audio by default.
  • Encryption: TLS 1.3 for transport; at-rest encryption for stored audio/transcripts.
  • Audit & policy: Maintain an access log and human review policies for flagged content.

Specific guidance for platform routing

If you route audio to platform STT (Siri or Android), note:

  • Apple’s Siri may route processing through Gemini or Apple servers depending on context; check the latest App Store policies and user privacy labels.
  • Android SpeechRecognizer may use Google services; confirm whether audio is processed on-device for your target Android API levels.
  • Document for users and in your privacy policy exactly which vendors may process audio.

Fallbacks and resilience

Real users expect the assistant to work in noisy, offline, or low-permission situations. Build multi-tiered fallbacks:

  1. Offline fallback: Local keyword detection or a small on-device model (quantized LLM/WASM) to provide basic commands.
  2. Partial transcripts: If audio streaming fails, upload final short clip to cloud STT and resume with a textual reply.
  3. Graceful degrade to text: Offer a fast text input with context captured from available (partial) audio.
  4. Cached responses: For recurring queries, cache LLM answers to avoid re-sending audio or hitting rate limits.

Performance & UX tips

  • Use Opus with ~20–40ms frames for low-latency streaming and good compression.
  • Render partial tokens immediately; show a typing indicator for continuity.
  • Limit on-device energy impact by pausing heavy tasks during active recording.
  • Measure round-trip latency (audio capture to first token) and optimize bottlenecks: network, encoding, and server forwarding.

Costs, rate limits, and reliability

LLM and STT costs matter. Streaming increases compute and egress costs. Mitigate with:

  • Pre-filtering: drop low-value sessions locally.
  • Sampling: only send full audio for complex intents; use on-device intent classification for simple ones.
  • Back-pressure: limit concurrent sessions per user.

2026 advanced strategies & predictions

Expect these trends to shape architecture decisions in the next 12–24 months:

  • Multimodal streaming: Gemini-class LLMs return images, layouts, and audio. Architect to accept and render those streams.
  • On-device micro-LLMs: Increasingly capable on-device models will handle private fallbacks and hot-word understanding without network calls.
  • Platform assistant bridges: Apple/Google will standardize bridges for third-party apps to call platform LLMs with clear privacy contracts. Plan to support both direct API calls and platform mediation.
  • Regulation will tighten: Expect more obligations around biometric/voice data under EU and US frameworks; design for data minimization now.

Case study — shipping a customer support voice assistant (example)

Context: a fintech app wants a voice assistant for quick account lookups and simple tasks. Requirements: high privacy, low latency, offline fallback for common intents.

Solution outline:

  1. Use on-device STT when available (iOS on-device speech), fall back to app-level streaming to a private cloud STT when more accuracy is needed.
  2. Send only parsed intents + minimum context to Gemini for answer generation. Never send PII unless user explicitly consents; use tokenization and hashing for identifiers.
  3. Cache common replies locally for instant offline answers (balance between correctness and currency).
  4. Provide a “Use system assistant” toggle to hand off to Siri for payments/secure operations; require biometric confirmation for critical actions.
Practical result: Lowered latency by 40% on average, reduced cloud STT calls by 70%, and achieved an auditable consent flow for sensitive actions.

Developer checklist — launch-ready

  • Choose STT path: on-device, platform, or cloud.
  • Implement native audio streaming for low-latency.
  • Build server proxy to Gemini (or use provider SDK) with token streaming support.
  • Expose privacy toggles and consent UI.
  • Prepare fallbacks and offline micro-LLM options.
  • Load-test streaming endpoints and measure end-to-end latency.
  • Document data flows in privacy policy and app privacy labels.

Resources & next steps

Start simple: wire local STT or platform STT to a demo Gemini prompt and add streaming visual updates. Once stable, replace text-only flows with audio streaming and TTS.

Actionable takeaways

  • Default to on-device STT for privacy, fall back to platform STT only when it reduces UX friction and you can disclose routing.
  • Use a WebSocket + native audio frames for the responsive streaming experience users expect in 2026.
  • Offer a clear consent and toggle if audio may be processed by Apple/Google or sent to Gemini.
  • Implement token streaming rendering and graceful offline fallbacks to retain usefulness when network fails.

Closing — build confidently in 2026

Integrating Gemini-class assistants into React Native apps is no longer experimental — platform bridges and cloud APIs make it practical. But the teams that ship reliable voice assistants will be the ones who treat audio as a first-class, privacy-sensitive data stream: architect for streaming, choose the STT trade-offs deliberately, and degrade gracefully.

Ready to accelerate: If you want a production-ready voice assistant starter kit (Expo + native bridges + Gemini proxy patterns + privacy UI), check our curated templates and vetted components designed for React Native teams building in 2026.

Call to action

Explore our voice assistant starter kit for React Native at reactnative.store — or contact our team for a tailored integration review. Ship faster, stay private, and make your assistant feel native.

Advertisement

Related Topics

#AI#Voice#Integration
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-01T01:36:17.617Z