Telling the computer what to think about

18.05, Tuesday 21 Sep 2021

There’s a subtle interaction in Apple’s new “Live Text” feature which I think is clever (though I would like to know how deliberate it is). We could call it focus and perhaps it’s a tiny clue for how we’ll interact with future intelligent machines.

Text in photos is detected automatically in the new iOS 15 for iPhone, released this week. (I believe Android has had this functionality for some time, but I’m not sure exactly how it works as I’ve been iPhone-only for a couple years.)

Live Text (that’s the Apple support doc) lets you copy and paste text from saved photos or even the live camera and it’s pretty magical: phone numbers from billboards, paragraphs from pics of books, even handwriting on post-its.

Here’s the subtlety, using the built-in Photos app:

  • An indicator icon appears when your phone sees text. Tap the icon to highlight the words
  • Sometimes the icon doesn’t appear! No text has been automatically recognised. This seems to especially happen with handwriting
  • Long press anyway on the text in the photograph - the phone will, if it can, select the text so you can copy/spell check/translate/etc.

I don’t know what to call this “long press and try again even though it didn’t work the first time” mechanic.

Perhaps: try harder?

It’s like the algorithm didn’t have enough confidence first time round to bring it your attention, but with some reassurance it can go ahead.

Tap to focus?


What is focus?

The brain has scarce realtime processing capacity. Attention is the feeling of it being directed.

I think of the attentional system as a pyramid. At the top is focus: it’s singular, you’re conscious, tuned it.

At the bottom is everything you’re not aware of. Your brain throws away background sounds, peripheral vision, the texture of clothes on your skin and the feel of the tongue in your mouth.

In the middle is what your brain is keeping tabs on. It’s unconscious, but if you hear your name spoken or a light flashes in the corner of your eye, the source of the perception will jump up to focus.

There is a catalogue of heuristics that your brain uses to move things up and the down the pyramid. An object growing from a point in your visual field will bring it to attention automatically, for example, but the same object fading in will not. But you can consciously choose to focus too.

So “attention” is this collaboration between the conscious mind and unconscious processes.


The analogy here is to centaurs, Garry Kasparov’s model for AI symbiosis: half intuitive human, guiding up top; half brute force machine providing the horsepower.

With Live Text there is a collaboration between the conscious human user who can step in and focus the show, and the algorithm which keeps tabs on absolutely everything but may need a confidence boost from time to time if it doesn’t automatically bring something to attention. The algorithm, always running, plays the role of the unconscious.

You need both working together.

Live Text without the ability to step in would feel frustrating and disempowering. You would have moments, as a user, where you could plainly read some text, but the computer wasn’t detecting it for whatever reason, and you would lack the ability to say – dammit right there!

But Live Text without automatic detection would be cumbersome, having to point and tap on every single photo to tell it to do something.

There’s a similar interaction in Apple’s new Cinematic Mode, which semi-automatically pulls focus as you shoot video. From this Engadget review: You can tap on parts of your viewfinder to change focal points as you shoot or let the iPhone decide for you by analyzing who and what’s in the scene.

Centaur film director!


So I want to generalise: every always-running algorithm should have the ability to be steered. Even, sometimes, when it doesn’t think it ought to run, to try harder, focus where it’s told to, execute the next lines of code anyway and to just do its best.

And given our future with machines will be all about these always-running algorithms and processes, maybe this should be a standard user interface widget?

Like: we have tap, swipe, pull to refresh, toggles, lists, and all the rest… can we have a visual representation that says “this is where the algorithm is looking” and what’s more “this is where I want the algorithm to look next.”

Focus is the right underlying metaphor, given the connection to the attentional system and resource allocation, but it feels too abstract.


AN IDEA for that standard UI widget: the playhead.

This is a bit of a reach, so a tangent…

Azlen Elza’s project, Geological Phonograph:

Just like a needle of a phonograph moves along the grooves of a record, what would it sound like to move a needle along the grooves and topology of the Earth? Each mountain, valley, field, or ocean might have its own unique sound, a special timbre or series of harmonics never before heard but waiting to be discovered in the very land beneath our feet.

The project is unfinished, but there is a tantalising photograph of a tiny screen showing a topographical map, twisting contour lines, with neat circles drawn on it - playheads - and a breadboard with four rotary encoders by way of sonic control.

Music is generated and plays as the circles drift over the topography.

There’s something gorgeous here about the collision in scales: ridges on the earth that would take hours to traverse treated as the ridges on a vinyl record, microscopic.

BUT, to me, what’s most thought-provoking is the playhead.

A point of focus that says to the machine: hey, out of the entire surface of the planet, play this exact bit right here.


Go back to what computers are all about, deep down – the Turing machine is a strip of tape filled with data and symbols, and the tape is moved back and forth and executed by, well, a playhead.

All that code running on the phone is a thousand processes and a thousand playheads.

Sometimes the detection algorithm gatekeeps the interaction code, and if the text or face or object isn’t detected, then the playhead never reaches the code for the pop-up menu and the user never sees it.

So I’m imagining a visible playhead, somehow, which the user can can direct to anything on the screen and say, hey, run the code you would run if you detected a phone number right here (or a face, or a block of text, or whatever).

Maybe it’s a long press? And then every available detection algorithm runs in turn.

Maybe you pull down a shade over the screen, and everything that has been detected but at low confidence takes on a little shimmer somehow, like a side mission in a video game?

Maybe it’s a tiny bird that lives behind the notch, and you drag down a trail of seed.

How do we tell the computer where to look? (And then there’s sound.)


SEPARATELY:

Elza’s Geological Phonograph makes me imagine an equivalent Sky Phonograph that is a camera pointing up at the dome of the sky, playing what it sees as it sees it, 24 hours a day, 365 days a year.

It would do a 2D Fourier transform to pick out frequencies, radially, I think, and play according to that. So a clear or gently overcast sky would have a low hum, but with broken clouds you would hear an orchestra of higher frequency tones, whipping this way and that as the wind picked up.

At night, the stars would sound electronic – a mix of frequencies sure, but clear bright tones layered on one-another, evolving slowly, analogue sounds coming in as clouds appeared, satellites like sliding whistles.

The music of the firmament.

Follow-up posts:

If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.