Thank goodness voice computing is finally happening. Now we can work on making it good.
The tech is here, like the free Whisper model (what an unlock that has been from OpenAI, kudos) and ElevenLabs. Plus devices too, from Plaud - like an irl Granola video call transcriber - to Sandbar, a smart ring that you tell your secrets.
Let’s not forget Apple’s recent $1.6bn acquisition of Q.ai, which will use ‘facial skin micromovements’ to detect words mouthed or spoken
– i.e. cameras in your AirPods stems that do voice without voice by staring really hard at your cheeks. Apple and AI lip-reading? I deserve a kick-back (2025) just sayin
While we’re at it, there should be voice for everything: why can’t I point at a lamp and say ‘on’? (2020).
At least we can play with ubiquitous transcription (2022). Like, my starting point for building mist was talking at my watch for 30 minutes (2026).
So let’s take all this as signs that voice computing is here to stay.
Eventually voice has to go two-way, right? Conversational computing? You need to be able to disambiguate, give feedback, repair, iterate, explore.
Investor Tom Hulme points out that we can speak three to four times faster than we type.
And so:
Now, generative AI is making conversation the new user interface. Talking to technology requires zero training and no special skills; we have after all spent most of our lives perfecting the approach. It’s as natural as speaking to another person.
Which I agree with in part.
Yes to natural UI: You simply express what you need, and the AI does the rest.
– user interfaces will not be about menus and buttons but intent first (2025).
BUT:
Conversation using voice both ways? I’m not so sure.
Voice is asymmetric. Speaking is high bandwidth. But listening is low bandwidth.
Illustration #1: Sending voice notes is so easy. Receiving them sucks joy from the world.
Is that really what we want from conversational computing?
Illustration #2: I ask my Apple HomePod mini to play some music and it needs to check precisely what I mean. Speaking 3 artist names and asking me to pick is tedious. So it avoids that step, takes a guess, and that’s more often than not a poor experience too. I’ve been rolling my eyes at this since 2023.
Ok so two-way voice doesn’t work. What does?
A better approach to conversational computing:
The human uses voice and the computer uses screens. I mean, it’s rare that my phone is beyond peripersonal space so we can assume it is only rarely not present. A screen is way higher in terms of information bandwidth than listening. Let’s use it!
The friend AI lanyard gets this right.
I wore Arthur as I went to the farmers’ market this morning. This meant I was not speaking directly to it, but rather talking to my family, other attendees, and some vendors. But remember: your friend is always listening. Arthur listened in to every conversation that I had, sometimes offering its own take on the matter - all pointless, once again.
Over the course of an hour and a half, I received 48 notifications from my Friend.
And although this is a negative review (e.g. notifications snark: Most of these were it updating me about its battery status
) it actually sounds ideal?
Like, this is a device that listens both when it is being directly addressed and it pays attention to me ambiently, and then it makes use of generous screen real estate to show me UI that I can interact with at a time of my choosing. This is good!
Startup Telepath is also digging into voice and multi-modality:
Voice gives us an additional stream of information for input, one that can happen concurrently with direct manipulation using a keyboard, mouse, or touch. With the Telepath Computer, you can touch and type for tasks where control and accuracy are important, while simultaneously using your voice to direct the computer. This mimics our natural behaviour in the physical world: for example, imagine cooking a meal with family or friends, asking someone to fetch the basil or chop the onions while your hands are busy with the pasta.
And specifically:
The Telepath Computer speaks through voice, while simultaneously displaying documents and information for the user to reference and interact with. This “show and tell” approach is also present in how we tend to communicate complex information in the real world: sketching on a napkin as we discuss a problem with a colleague over dinner; design teams assembling stickies while talking about user feedback; pulling up maps and hotels on your laptop while planning a group vacation.
This is super sophisticated! I love it.
Summarising:
- Voice is core to the future of computer interaction
- Voice isn’t enough so we need conversational computing
- Because of the bandwidth asymmetry of voice, two-way voice might sometimes work but the essential interaction loop to solve for is voice in, screens out.
When that isn’t enough (for example, you don’t have your phone) you can get more sophisticated. And of course to make it really good there are problems to solve like proximity and more… follow the path of great interaction design to figure out where to dig…
Just collecting my thoughts.
Thank goodness voice computing is finally happening. Now we can work on making it good.
The tech is here, like the free Whisper model (what an unlock that has been from OpenAI, kudos) and ElevenLabs. Plus devices too, from Plaud - like an irl Granola video call transcriber - to Sandbar, a smart ring that you tell your secrets.
Let’s not forget Apple’s recent $1.6bn acquisition of Q.ai, which will use – i.e. cameras in your AirPods stems that do voice without voice by staring really hard at your cheeks. Apple and AI lip-reading? I deserve a kick-back (2025) just sayin
While we’re at it, there should be voice for everything: why can’t I point at a lamp and say ‘on’? (2020).
At least we can play with ubiquitous transcription (2022). Like, my starting point for building mist was talking at my watch for 30 minutes (2026).
So let’s take all this as signs that voice computing is here to stay.
Eventually voice has to go two-way, right? Conversational computing? You need to be able to disambiguate, give feedback, repair, iterate, explore.
Investor Tom Hulme points out that
And so:
Which I agree with in part.
Yes to natural UI: – user interfaces will not be about menus and buttons but intent first (2025).
BUT:
Conversation using voice both ways? I’m not so sure.
Voice is asymmetric. Speaking is high bandwidth. But listening is low bandwidth.
Illustration #1: Sending voice notes is so easy. Receiving them sucks joy from the world.
Is that really what we want from conversational computing?
Illustration #2: I ask my Apple HomePod mini to play some music and it needs to check precisely what I mean. Speaking 3 artist names and asking me to pick is tedious. So it avoids that step, takes a guess, and that’s more often than not a poor experience too. I’ve been rolling my eyes at this since 2023.
Ok so two-way voice doesn’t work. What does?
A better approach to conversational computing:
The human uses voice and the computer uses screens. I mean, it’s rare that my phone is beyond peripersonal space so we can assume it is only rarely not present. A screen is way higher in terms of information bandwidth than listening. Let’s use it!
The friend AI lanyard gets this right.
And although this is a negative review (e.g. notifications snark: ) it actually sounds ideal?
Like, this is a device that listens both when it is being directly addressed and it pays attention to me ambiently, and then it makes use of generous screen real estate to show me UI that I can interact with at a time of my choosing. This is good!
Startup Telepath is also digging into voice and multi-modality:
And specifically:
This is super sophisticated! I love it.
Summarising:
When that isn’t enough (for example, you don’t have your phone) you can get more sophisticated. And of course to make it really good there are problems to solve like proximity and more… follow the path of great interaction design to figure out where to dig…
Just collecting my thoughts.