Rooms, voice, gestures, and why Apple HomePod hasn’t quite clicked for me

20.50, Thursday 26 Jan 2023

I’ve been thinking about gestures and rooms as some of the primitives for situated computing, and how acts like pointing and gestures could be braided together.

I recently have an Apple HomePod mini in my home office for reasons. It’s the only smart speaker in the house – I’ve always been cautious about privacy, and anyway my use of voice (via my watch) has never stretched beyond setting the timer for cooking and finding out how old celebrities are.

That said, I now call to the HomePod to play music and it is 70% ok and 30% frustrating as hell.

For instance: it didn’t understand what track I wanted so I kept saying Hey Siri… followed by various incantations. Then I fell back to playing the album and using my phone to skip to the right track. But the UI to select the device from the Music app is buried on the Now Playing screen and in totally the wrong place in the user flow. Or should I be accessing this via the Home app? And, and…

So I started making notes about how it could be better. Like:

  • Siri should be conversational if it needs to disambiguation or if I need to fix a response (called “repair”)
  • I should be able to point my phone at my HomePod and have the Music app appear immediately, for that specific device
  • Besides, devices should be top-level devices across the OS, and Home should make compatible apps available dynamically in a folder, with the Home app itself just used for settings
  • Volume control is just baffling (visually it looks like the volume of my phone, but actually I’m controlling the volume of the music… or the volume of the device?)

…and so on.

But when I imagine this it feels convoluted and piecemeal. It may be “logical” but it would be hard for users (me!) to build their own mental model; it would still be hard for designers to reason about.

This is the same place I ended up when trying to reinvent 1950s collaborative map rooms – the conceptual framework is all wrong.


(Naturally the correct response to being momentarily grumpy about the UX of playing a song is to write up a demand for 5 years of work as a blog post…)


The room is an environment for embodied interaction

I don’t know quite where this will land but I have just a hunch, just the outlines of a conceptual framework.

We’ve got a couple competing models already:

  • HomePod descends from voice-first smart speakers (a category invented by Amazon in 2014 with the Echo) – and voice interfaces suffer, even if you solved for conversation and repair, from poor discoverability and expressiveness
  • Smart home gadgets (lighting, HVAC, security) treat the phone as a remote control – but physical gadgets are by their nature shared, and phones are by their nature personal (and not carried by everyone).

So those don’t work, and clash besides.

Instead I’d like a conceptual framework that starts with a few principles:

  • The room is the common ground of the interface – the place of mutual knowledge between user and distributed “computer”. It’s analogous to a desktop on a screen, if you like
  • Physical things are icons (I unpacked icons here) – by which I mean that this is where the user mentally locates their state, even if that state literally happens to be in the cloud or whatever. Alerts, conceptually, come from the relevant device (and maybe that’s done with badges or maybe it’s done with spatial audio)
  • Groups not users. A room can contain any number of people; people have different roles
  • People are embodied. Gesture matters, orientation matters.

Then the way we break this down is to focus on phases of interaction, not mode (voice, keyboard, etc), and ensure that what we’re doing is humane (familiar, intuitive, call it what you will).

For example: the micro-interaction of focus.

There is always a moment where a user selects an object to talk to; to grant focus for subsequent commands. Right now I do that by using a wake word: Hey Siri.

But now I’m thinking of acts I realise that, sure, I could use a wake word, but it could also be gestural wake: pointing or glancing or unambiguously stepping closer.

I talked about this before: How I would put voice control in everything (2020).

Why can’t I point at a lamp and say “on” and the light come on? Or point at my stove and say “5 minutes”? Or just look at it and talk, if my hands are full.

Then there’s the micro-interaction of issuing a command.

Sure you might point and speak. But then we might also say that anyone in the room can hold up their phone and see that action occurring on the object’s soft interface, an app screen, so they can clarify either by talking or tapping… a kind of “lean closer” moment.

Aside from interaction design, there are broader questions:

  • Voice – how can it not be lame? AI LLMs provide a route here, because it’s now possible to make statements of equivalent intent (which are nearby in latent space) rather than the fixed nouns and verbs declared by the developer. Is there something like a device API or scripting surface that means that the voice interface can be auto-generated?
  • Clarifications, repair, and shortcuts – we’re not dealing with “commands” here but micro-conversations in which a device can ask for more details to fill in the gaps. But, equivalently, how can a user speed up an interaction if they know what they’re asking for?
  • Affordances and discoverability (how do I know what this device can do) – a big one! But is this answered by making every interaction multimodal? Is holding up your phone equivalent to hitting the COMMAND key on an iPad and seeing an overlay of all the shortcuts?
  • Universality, privacy, and roles – how can anyone come into the room and turn on the lights? My dream would be to do all of this entirely on-device…
  • Commerciality – is it possible to instrument a room such that there is measurability of interaction funnels for iteration, without breaking privacy? How is it possible to grow the ecosystem while there are always going to be dumb devices, and what of interop is possible?

A broader question: how does this play with telepresence, connected spaces, or overlaying virtual and physical space? I feel like the answer to how to access devices remotely is downstream of this bigger framing.

Last, I think the scope of what is in a room has to increase. A projector, a TV, a spare screen, and other devices are as much part of this computing environment as speakers and gadgets.


I know it’s simple. But I find this conceptual framework easier to work with, and more generative for ideas, than considering devices in an isolated fashion? I guess what I’m after is something as straightforward to grasp, as achievable, and as profound as the desktop metaphor itself, only for situated computing.


In a initial sense I would like to have interactions that are simply:

  • (Glances over). What’s that Eno Hyde album I’ve been playing a bunch lately? - Someday World - Yes play that.
  • Pointing while holding my phone, choosing an album with the app, and someone else in the room disagrees and calls out skip song.

But also I would like to be have that multiplayer map room with projectors and shared displays and personal displays and proximity audio for hybrid presence, and be able to clearly set out the technology Lego bricks to achieve this.

Or imagine doing something like writing on a piece of paper and holding it up to a webcam as a natural step in a conversation, and knowing how that would be integrated in the interactions.

I’m talking about rooms and homes here, but Just Walk Out by Amazon is also a situated interface… it’s a computer with cameras and sensors (and the ability to take payment from credit carts), situated in a shared environment, and how can that fit in our same conceptual framework?


Anyway. Room as distributed computer that we stand in. Objects have state. Interaction-first not mode-first.

Then you figure out how it actually works by building and trying. I’ve started experimenting (just on my own) with hysteresis curves for focus with pointing-based interactions. It’s intriguing to play with gestures. But that’s another story.


Update 27 Jan. Steven Smith stopped by to let me know Handoff in iOS which does indeed pop open the remote control pane for HomePod when I hold my phone a couple inches away! So let’s take that as a mini thought experiment because it’s a neat capability: from a micro-interaction perspective, I would want this capability to meet an “increase engagement” moment in a conversation, and to be available and afforded at that moment. While the pane itself is good, the current gesture is an “initiating” micro-interaction. So this pane should be peeping on my phone whenever I’m in a conversation with that HomePod using Siri, in the same language as the remote controls for the current room (currently buried in Control Center).

BUT this also feels a bit like getting lost in the weeds – step one is the framework. What’s the UI for a room?

Follow-up posts:

If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.