Your Voice Is Faster Than Your Hands: Voice and AI

The Next Form Factor Starts With Voice

Apr 15, 2025

Your Voice Is Faster Than Your Hands: Voice and AI

The Next Form Factor Starts With Voice

1. Why Is Voice Better?

1a. Hands Are Faster Than Eyes, and Speech Is Faster Than Hands

As I wrote previously, I believe humans still using keyboards, mice, and trackpads to communicate with computers is an absurdly outdated method that demands way too much cognitive load.

Research from a 2016 study confirms the obvious: speech is roughly 3x faster than typing. That means we can convey what we need to a computer faster and more efficiently by talking. In an era where AI has gotten remarkably good at speech recognition and interpretation, voice becoming the primary input method isn’t a bold prediction — it’s common sense.

But What About Output? Is Voice Efficient for That Too?

Personally, when consuming content, I still prefer text over audio. Text lets me scan by sentence or paragraph, skip what I already understand, and move at my own pace. Audio forces you to listen sequentially, word by word, which is painfully slow for anyone who processes information quickly.

This led me to two possible solutions: (1) input is voice, but output is visual, or (2) we fundamentally change how language itself works in human-computer interaction.

Think about it — in most conversations, we can already predict someone’s intent from the first few words. That’s exactly why impatient people (guilty as charged) cut others off mid-sentence. What if AI could streamline computer output in a similar way? Not generic human speech, but information designed for maximum human absorption — efficient conversation.

What Does “Efficient Voice Conversation” Look Like?

I think there are three components:

First: Lead with the conclusion. This is something I emphasize to every founder I meet, and something I constantly work on myself. Start with the bottom line, then explain. It’s faster to understand contextually, and if the listener already has background knowledge, both sides can recognize that immediately and save time. It’s no accident that startup pitches, presentations, and simultaneous interpretation all follow this structure.

Second: Start with the big picture, then drill down based on the listener’s reaction. Rather than explaining everything from scratch assuming zero knowledge — which looks thorough but can be a massive time waste — give the high-level view first and adjust depth based on feedback.

I experience this constantly. When founders pitch me, they don’t know how much I already know about their industry, so they start from the basics. I’ll quickly signal “I’m familiar with this” and give feedback so we can move to the next topic together and save each other’s time. When I’m the one requesting meetings, I usually already have a thesis on the space, so we jump straight into Q&A and I build my understanding of the company through that dialogue.

Third: Emotionally compressed communication. The difference between responding yes, “yes, sure” and “will do” carries completely different emotional weight — formality, warmth, efficiency — all packed into a single response.

These subtle emotional contexts allow short utterances to convey enormous amounts of meaning, making communication dramatically more efficient. This means the emotional design of speech becomes critical, and understanding it becomes equally important. We’re heading toward a world where the more perceptive humans — and AIs — have the advantage. Emotional intelligence as competitive edge.

This might read like Communication 101, but if AI/computer output evolves in this direction and people adapt to it, those who embrace this mode of interaction will have a serious structural advantage.

The Fascinating Difference Between English and Korean

Here’s something I find personally interesting: English naturally leads with the conclusion and starts high-level, while Korean tends toward detail-first with the conclusion at the end (this is my personal observation from years of hearing pitches and presentations in both languages).

When you think about the strengths of each: Korean’s detail-oriented structure is actually better as input for AI — giving thorough, meticulous instructions produces better outputs. English’s conclusion-first structure is better as output from AI — humans can process and understand it faster.

The implication? Being fluent in both communication styles and switching between them freely is itself a powerful weapon. A bit of a tangent, but a fascinating one.

1b. It’s Instinctive and Intuitive

I won’t argue that people shouldn’t learn keyboards and mice — we live in that reality. But the fact that so many people struggle with kiosks and self-service terminals tells you something: our current interfaces aren’t intuitive enough.

Voice is the most instinctive and intuitive form of human expression. It’s our most primal input method, and technology is only now catching up to it.

This has major implications for interface design. The best interface is one you already know how to use — one that requires zero learning. When users can start using something immediately without training, the adoption speed advantage over any other input method is overwhelming. In an era where AI can understand language well and process information quickly, this advantage only gets amplified.

1c. Your Eyes and Hands Are Free

Most existing interfaces monopolize both hands and eyes simultaneously. Keyboards need your hands. Mice need your eyes tracking the cursor. Touchscreens require you to look at what you’re pressing.

Voice is different. Voice liberates both vision and touch at the same time. Changing navigation routes while driving, switching songs while jogging, setting a timer while cooking — in all these situations, voice does what no other input method can.

That’s why I don’t think of voice interfaces as merely “convenient.” They don’t interrupt your flow. This is a crucial distinction. Voice doesn’t create the sensation of “using an app” the way screen-based interfaces do. It naturally connects you to the computer as an extension of whatever you’re already doing. That’s a paradigm shift in user experience. Voice is the interface you can layer on top of everything else you’re doing — and that’s what makes its potential enormous.

2. If Voice Leads, How Does Hardware Change?

As AI-powered voice processing improves, the natural next step is a transformation in how we use our portable computers — our phones — and therefore a revolution in form factor. Humane AI Pin and Rabbit failed, but they reflected this trend. My personal favorite, Meta smart glasses, does too. (A premium Meta smart glasses model is supposedly dropping this year at close to $1,000 — I’m absolutely buying them.)

So what direction does form factor evolution take? Here’s my prediction:

2a. You Need to Be Able to Wear It All Day

If having constant access to AI gives you a structural advantage over people who need multiple steps to access it, the new form factor has to be comfortable enough to wear all day. It shouldn’t block your ears (safety), it should be light, and it should look natural.

I tried open-ear earbuds (pictured center) — they checked every box except the mic quality was too poor for meetings, so I returned them. The Shokz OpenRun Pro 2 Mini, on the other hand, has been perfect. Bone conduction means no ear fatigue, they’re relatively inconspicuous, and the mic is good enough for both meetings and AI conversations. I wear them literally all day. (Side note: Whoop is more comfortable than Oura.)

2b. You Need to Communicate Privately

The Shokz work great at home, but in public, people can hear you talking to AI and read your lips — a real privacy concern, especially in the U.S. I’ve thought about a mic you can cup your hand over, or a future ring-shaped form factor with a built-in microphone. Privacy-preserving voice input is a real design challenge that needs solving.

2c. A Small But Essential Screen

Voice will keep improving, but I think screens will remain necessary for some time. Sometimes you need to visually confirm something. Sometimes voice just can’t get the job done. Screens might eventually disappear entirely, but that’s probably after we get true autonomous driving, reasoning AI, and humanoid robots working alongside humans in factories — so maybe 5-10 years out?

The form factor I’m most excited about is smart glasses. My Meta glasses already have speakers that only I can hear and a camera that can observe my surroundings. For dedicated users who are comfortable wearing glasses, adding AR capability with a small screen projected onto the lens would be incredibly powerful.

Glasses are relatively heavy and change your appearance, but I’m still bullish on them as a next-gen form factor precisely because of the screen advantage. A floating holographic display would be even better, but commercial viability for that is much further out.

3. If Voice Leads, How Does Software Change?

3a. Another Great Reset

The rise of voice-first interfaces means declining dependence on screens, which means form factors change, which means the UX of every existing piece of software has to be rebuilt. And that rebuild is an opportunity — every company starts on equal footing again.

For example: if Uber is slow to adapt to voice-first UX and Lyft ships something innovative, customers could actually switch. That’s a reset that rarely happens in mature markets.

At the same time, accuracy becomes more critical than ever for AI companies. Voice-first interaction and new form factors demand precision above all else. The AI companies that win on accuracy will capture enormous value — and this gives every model company a fresh chance to compete.

3b. The Evolution Toward Conversational Content

When I think about where my screen time actually goes, it’s overwhelmingly content apps — YouTube, Netflix, Paramount+. Most utility apps transitioning to voice-first seems relatively straightforward. The real hurdle to leaving the screen behind is content consumption.

This is where I think a massive innovation is coming: the shift from one-directional to two-directional content. Not passively watching or listening, but continuously interacting with AI to shape the story, creating a holistic experience rather than consuming a fixed one. I believe this becomes the dominant content format in the voice-first era. And honestly, this might be the future for Netflix, which has arguably hit its growth ceiling with traditional passive streaming.

3c. Clubhouse Was Ahead of Its Time — The Voice-First Social Era Is Coming

The apps I spend the most time on might be content platforms, but the apps generating the most engagement are social networks. Here’s what’s interesting: I already send many of my messages via voice-to-text. Unfortunately, it still lacks the emotional compression I mentioned earlier — messages come out too stiff and formal. But if voice interfaces become mainstream, I think social networks will inevitably shift toward voice-first too.

Watching Clubhouse’s rise and fall, I saw them attempt to move from live conversation to asynchronous voice interaction. It was a fascinating experiment, but it was premature — the AI-powered voice infrastructure wasn’t ready yet. Clubhouse was ahead of its time. Now, all the ingredients are finally in place, and I believe a new voice-first social network can actually succeed.

Cut to the Chase

The evolution toward voice has been a natural trajectory driven by human instinct and efficiency — a trend that predates Amazon’s Alexa and Apple’s Siri by decades. But as illustrated by the staircase model of technological progress, this direction had been stuck at the top of a step, unable to climb to the next level due to insufficient technology.

This generation of AI has changed that. Voice-based interfaces, like every other industry, have been given a new opportunity. I believe this interface revolution will be the foundation not just for unicorns, but for companies that genuinely change the world.

The Long Game

Ready for more?