For a few years now, we’ve been using AI to analyze meeting notes at Hyperflow. Every call gets recorded. Every recording gets transcribed. Every transcript gets fed to AI for summaries, action items, and follow-ups. That pipeline has been running in the background for a while, and it’s been good. Useful. Reliable.

But a few weeks ago, right around the time I moved everything over to OpenClaw, I started noticing the gaps.

A client would share their screen and walk through a dashboard. They’d point at a chart and say, “this number right here, that’s what we need to fix.” The transcript would capture the words. What it wouldn’t capture: which number. Which chart. The thing they were literally pointing at on screen.

Or we’d be in a design review. Someone would pull up a mockup and say, “I don’t love the spacing on this section.” The transcript gives me the words. But the words without the visual? Useless. I’d have to go back, re-watch the recording, find the moment, screenshot it, then manually connect it to what was said.

That’s the kind of work that doesn’t feel like work. It feels like being thorough. But it’s the same thing every time: re-watch, find, screenshot, connect. Over and over. For every call with a visual component, which at Hyperflow is most of them.

So I added one thing to the pipeline. And it’s changing the game.

The Addition

I gave my AI eyes.

Instead of only reading the transcript, OpenClaw now pulls frames from the video recording. Not every frame. Key frames: moments where the screen changes significantly, where someone shares their screen, where a new document or mockup or dashboard appears.

Then it analyzes those frames with vision models and ties them back to the transcript. It knows what was on screen when someone said what they said. And from that, it generates visual to-dos: annotated screenshots paired with the specific action items that came out of that moment.

The difference between “fix the spacing issue Sarah mentioned” and a screenshot of the exact section with an annotation saying “Sarah: reduce padding between header and chart, feels cramped” is night and day. One requires me to remember context. The other gives me the context.

The Pipeline

how it works (roughly sketched)

Video RecordingAuto-saved from call
TranscriptAuto-generated
Frame ExtractionKey moments pulled from video
Vision AnalysisOpus 4.6 + OpenAI Vision
Transcript MappingFrames matched to dialogue
Opus 4.6vsOpenAI Vision
Visual To-DosAnnotated screenshots + tasks
Context CardsWho said what, when, and why

The Setup (It Was Already Halfway There)

Here’s why this was easier than it sounds: the infrastructure was already in place.

Every call at Hyperflow runs through a shared Google Drive folder. This folder already gets populated automatically with two things after every call: the full video recording and the transcript. That’s been our system for months. Nothing new there.

So OpenClaw was already watching that folder. It was already picking up transcripts and processing them. The addition was telling it to also grab the video file, extract key frames, and run them through a vision model before generating the summary.

The source material was sitting right there, untouched. I was feeding my AI the text version of a video call and wondering why it missed the visual parts. In hindsight, that’s like giving someone a phone transcript of a movie and asking them to describe the cinematography.

Two Models, Head to Head

Right now I’m testing this with two different vision setups, running side by side.

Opus 4.6 handles the full pipeline on one path. It reads the transcript, analyzes the extracted frames, and generates the combined output. It’s good at understanding context across a long call and connecting frames to the right parts of the conversation. The summaries feel cohesive. It doesn’t lose the thread.

OpenAI’s Vision models run the same pipeline on a parallel path. Same frames, same transcript, same prompt structure. Different model doing the analysis.

I’m not ready to declare a winner. Both produce useful output. Opus tends to be better at the narrative connections (understanding why something was said in context). OpenAI’s vision is strong on the raw image analysis (identifying UI elements, reading text from screenshots). The ideal might end up being a combination: one model for frame analysis, the other for synthesis.

The point isn’t which model wins. The point is that both of them produce dramatically better meeting follow-ups than transcript-only analysis. The vision layer is the unlock. The specific model is a tuning decision.

What the Output Looks Like

After a call ends, here’s what I get within about 15 minutes:

Vision Report

Hyperflow Weekly Sync

Feb 6, 20262:00 PM42 min
SC
MR
AR
JL
12Frames Captured
6Action Items
3High Priority
8Key Moments
0:0042:18
01
SC
Sarah Chen
14:23

"This conversion rate on the pricing page is way too low. We need to rethink the layout above the fold."

app.hyperflow.co/analytics/pricingScreen Share
Pricing Page Analytics
Last 30 days
Visitors
12,847
Conversion Rate
2.1%
Avg. Time on Page
1:34
Bounce Rate
64%
Conversions Over Time
Jan 7
Jan 14
Jan 21
Jan 28
Feb 4
Above the Fold
!2.1% CVR (target: 4.5%)
CTA buried below fold
No clear hierarchy above fold
Generated Tasks
Redesign pricing page above-the-fold layout
HighDesign@ 14:23
Move primary CTA above the fold with clear visual hierarchy
HighDev@ 14:23
A/B test new layout against current (target 4.5% CVR)
MediumProduct@ 14:31
02
MR
Marcus Rivera
27:41

"The onboarding flow drops off right here at step 3. People aren't completing the profile section."

app.mixpanel.com/funnels/onboardingScreen Share
Onboarding Funnel
Last 7 days · 2,341 users
Sign Up100%2,341
Email Verified82%1,920
Profile Complete48%922-34% drop
First Action41%960
!34% drop-off at Step 3
Steps 1-2 healthy
Generated Tasks
Audit Step 3 profile fields and reduce required inputs
HighProduct@ 27:41
Add progress indicator to onboarding flow
MediumDesign@ 27:45
Test "skip profile" option with delayed completion prompt
MediumEngineering@ 28:02

Each moment from the call gets its own card. The annotated screenshot shows exactly what was on screen. The speaker’s words are tied directly to what they were looking at. And the to-dos aren’t vague. They’re specific, visual, and attributed.

Three months from now when someone asks “who requested that change?” I have the receipt. Not a note I typed. A screenshot of what they were pointing at, with their exact words attached.

Why This Matters More Than It Sounds

Here’s the thing about meeting follow-ups: everyone does them, and almost everyone does them badly.

You leave a call. You have a vague list of “things we discussed.” Maybe you typed some notes. Maybe your AI transcription tool gave you bullet points. But the connection between what was said and what was shown is gone the moment the call ends. It lives in your memory, and memory is unreliable.

I used to compensate for this by taking detailed notes during calls. Which meant I was half-present in the meeting. I was there, but I was also documenting, screenshotting, and organizing instead of listening.

Now I’m fully in the call. I don’t take notes. I don’t screenshot anything. I listen, I contribute, I pay attention. And when it’s over, OpenClaw hands me a visual record that’s more thorough than anything I could have produced manually. Because it saw everything I saw, and it didn’t get distracted.

That’s the actual shift. Not “AI does my meeting notes.” That’s been possible for a year. The shift is: AI sees what happened in the meeting the same way I do. Visually. In context. With the full picture.

Give Your AI Eyes

If you’re running any kind of AI meeting analysis, even a basic transcript summary, consider what you’re not feeding it. If your calls involve screen shares, demos, design reviews, dashboard walkthroughs, or anything visual, you’re giving your AI a partial picture and expecting a complete analysis.

The video is already there. Most recording tools save it automatically. Most of us ignore it after the call ends. But that video contains information that the transcript doesn’t. And vision models are now good enough to extract it.

You don’t need my exact setup to start. You need a recording, a vision-capable model, and the willingness to experiment. Extract some frames from your last call. Feed them alongside the transcript. See what comes back.

I started this as a small experiment three weeks ago. Now it’s the part of my workflow I’d fight hardest to keep.


Transcripts gave me words. Vision gave me context. The combination gave me meetings I don’t have to re-watch. I’ll take that trade every time.