Capturing Both Sides of a Zoom Call Without a Bot

If you have ever sat through a video call and watched a smiling fifth participant join with a name like “MeetingHelper AI Notetaker,” you know the pattern. Cloud transcription tools usually join the call as a separate attendee, capture audio via the platform’s bot APIs, and ship it off to their servers. It works. It also requires a host’s permission, a stranger’s presence in your meeting, and a round trip through someone else’s infrastructure.

I did not want any of that for lognote. So the question became: can a Mac record both sides of a call (the mic plus whatever comes out of the speakers) without joining the call as a participant?

It can. The answer is more interesting than I expected.

What “both sides” actually requires

To transcribe a meeting end to end you need two streams:

Your microphone, which captures you.
The system audio coming out of your speakers, which captures everyone else.

The microphone has been a solved problem on macOS for years. AVCaptureSession will hand you the mic input with a standard permission prompt. The system audio side is where things get interesting.

For a long time, capturing what your Mac was about to play required a kernel extension or a virtual audio driver (BlackHole, Soundflower’s descendants, Loopback) that sat between the app and the speakers and copied the bytes as they went past. Those work, but they ask a lot of the user: installers, permissions, sometimes a reboot.

macOS 14.4 introduced a native CoreAudio process tap API. That changed the calculus.

CoreAudio process taps

A process tap is exactly what it sounds like: you ask CoreAudio for a virtual audio object that taps into the output of one or more processes (in our case, the global mix of everything that is not us) and exposes that as something you can read. No kernel extension. No driver install. Permission gated behind a standard TCC prompt.

This is the platform shift worth pausing on. For over a decade, capturing system audio on a Mac meant shipping a driver, and drivers are scary to install, ship, and support. With AudioHardwareCreateProcessTap and CATapDescription, Apple made a first-party primitive for what people had been hacking around for years. The difference between “you must install a kext” and “the OS will ask the user once, like it does for the microphone.”

The tap itself is not the whole story. You cannot install a realtime audio callback directly on a tap. You wrap it in an aggregate device, point that aggregate at an output device as its clock source, and install the callback on the aggregate. The aggregate is what clocks the samples for you so you can read them on a realtime thread.

There are real-world rough edges to handle here. The stubborn one: if the user changes their default output device mid-recording (plugs in headphones, switches to AirPods, the OS quietly reroutes), the aggregate keeps firing but the samples can come back as silence. The OS has told you there is audio. There is not. The fix is to watch the default-output device and rebuild the aggregate against the new clock when it changes. I won’t claim the rebuild handles every audio-routing edge case, but it covers the common ones.

The mic is the easy part

AVCaptureSession is straightforward. You find the audio AVCaptureDevice, add it as an input, attach an audio data output with a delegate, and you receive sample buffers on a serial dispatch queue. Permission is the standard microphone TCC prompt.

The interesting part is not the mic itself. It is how you make the mic and the system tap agree on time.

Two clocks, one file

Each input source has its own clock. The system tap reports presentation timestamps anchored to the aggregate device’s clock. The mic reports timestamps anchored to its own capture session. These two clocks are not the same clock, and if you write them into a file naively, they will drift, jump, or land at confusing offsets.

The approach: pick a single session anchor at the moment recording starts, and rebase every sample buffer’s presentation time against that anchor before writing it. Both streams read as offsets from t=0 in the file. They are not perfectly aligned to within a sample (a true cross-clock sync problem is harder than this), but they are aligned closely enough that when you transcribe them later, the timestamps interleave cleanly.

I considered, briefly, mixing the two streams into a single mono channel before writing. Don’t. The whole point of having both is that you keep them separate.

Why two tracks instead of one mix

Here is the design choice that made everything downstream easier.

You can write both audio streams into the same .m4a container as two distinct audio tracks. Track 0 is the system audio (stereo, from the tap). Track 1 is the mic (mono). They share the file. They share the rebased timeline. But they are addressable independently.

Why does that matter? Because when you hand the file to ffmpeg later, you can pull out just track 0 or just track 1 with a single map flag:

ffmpeg -i recording.m4a -map 0:a:0 -ac 1 -ar 16000 system.wav
ffmpeg -i recording.m4a -map 0:a:1 -ac 1 -ar 16000 mic.wav

Now you have two clean, single-source WAV files at 16 kHz mono. Whisper expects exactly that. You run the transcription twice (once per track), and tag every segment with which speaker it came from. Track 0 becomes “others” (everyone on the call). Track 1 becomes “me.” You interleave the segments by start time, and now you have a meeting transcript with speaker labels, without ever running a real diarization model.

Whisper does not natively do diarization. Two tracks does. Free, in the architectural sense.

The trade-off is that it’s a two-label diarization. Three people on the call all get pooled under “others” because they came out of the same speakers. Real per-participant labeling (separating remote attendees from each other) is a separate, harder problem, and probably involves looking at the call platform’s actual participant streams. For most meetings, “me” vs “everyone else” is the label that matters most.

Why MLX matters here

Recording is half the story. The other half is what you do with the file.

Apple’s MLX framework is the second piece of the platform shift. MLX targets the unified memory architecture of Apple Silicon directly, and mlx-whisper is a port of OpenAI’s Whisper that runs on it. Transcription happens on the Mac’s GPU and neural engine without sending a byte to a server.

Five years ago, this combination did not exist. You could not capture system audio without a kext, and you could not run a serious transcription model on a laptop. Both of those things are now true on the same Mac, at the same time. lognote is a thin wrapper around the consequence: if the capture is native and the transcription is local, the recording never has to leave your machine.

The file lands on disk and that is the whole story

When you hang up, lognote stops the recorder. Both tracks flush, the .m4a finalizes, and the process exits. No upload. No background sync. The file sits on your machine, next to the transcript that gets written a few minutes later.

Open the resulting .m4a in QuickTime and you can see both tracks. Mute one and you hear just the mic; mute the other and you hear just the room. Transcribe both and you get a meeting note with both sides interleaved.

If you delete the file, it is gone. The only copy was always on your Mac.

What this approach gives up

The honest limits:

Real-time streaming. A bot that joins the call can stream transcription live. lognote can’t, because the model runs after the recording finishes. Latency for locality.
Per-speaker labels for remote participants. Bots can subscribe to each participant’s individual audio stream. We are working from the mixed output of your speakers, which is a one-way street.
Cross-device anything. This is a Mac story. The architecture rests on macOS 14.4+ on Apple Silicon. iPhone, iPad, Windows, Linux are out of scope today.

What you get back is what the bot approach quietly takes away: a meeting with no extra attendees, no host approval, and no third-party copy of the conversation.

The reframe

For a long time, recording a meeting locally meant compromising on something. You installed a driver, joined the call as a bot, or shipped your audio to a server. The interesting thing about macOS 14.4 plus MLX is that the compromise stopped being load-bearing. Capture is a TCC prompt. Transcription is a GPU pass. The recording is a file on your disk.

Once that’s true, a meeting recorder doesn’t need to be a service. It can just be a tool that produces a file. That feels like the right level of complexity for something that listens to your conversations.