Here comes the Muybridge camera moment but for text. Photoshop too (Interconnected)

Here comes the Muybridge camera moment but for text. Photoshop too

10.49, Friday 31 May 2024 Link to this post

Can you measure the velocity of concepts over a piece of text, e.g. 0.5 concepts/word?

Yes. Or rather, well, something like that, possibly one day soon, it’s interesting.

I want to unpack that thought.

Hey, an editorial note:

This post is for me, not for you haha

What I like to do (and what I also do for clients) is to string together weak signals and see where it takes me. I get to new places when I think out loud.

The process is… meandering. And technical. And lengthy.

So feel free to skip to the tl;dr at the bottom if you want to know where I end up.

Background: Embeddings

There’s an AI-adjacent technique called “embeddings.” A word, or a phrase, or a paragraph is mathematically converted into coordinates. Just like a location on a map is described by lat and long.

Only the “map” in this case is a map of concepts. So if two phrases mean roughly the same thing, their coordinates are close together. If they mean different things, they’re further away.

Simon Willison has a great deep-dive into embeddings (2023).

But let me give you an example so you can get a feel for this…

I built an embeddings-powered search engine for my unofficial BBC In Our Time archive site, Braggoscope. There are a 1,000 episodes on all kinds of cultural and historical topics, so it’s a good case study.

Go to Braggoscope and hit search:

Search for jupiter – the episode about the planet Jupiter is at the top of the results
Search for the biggest planet – again, the episode about the planet Jupiter is at the top. There is no synonyms database here, which is how a traditional search engine would work. The phrase the biggest planet has been translated into its “coordinates” and the search engine has looked for “nearby” coordinates representing episodes.
Search for main roman god – this is also Jupiter, but a different one: this Jupiter is the king of the gods in the Roman pantheon. The top result is an episode about Rome and European civilisation, not the episode about the planet Jupiter, showing that embeddings can distinguish concepts even when similarly named.

I wrote a technical deep dive on how to create this search engine back in January on the PartyKit blog: Using Vectorize to build an unreasonably good search engine in 160 lines of code (2024). (That post has diagrams and source code.)

But what I want to emphasise is how little code there is.

Embeddings are coordinates in concept-space (technically called “latent space”.) You get things like search for free.

But embeddings also change our relationship with text, and what we can do with text, and I just want to use this post to collect a few hints and speculations as to what that means…

An instrument to see the invisible semantic structure of text

Back to concept velocity.

First, look at this visual plot of an essay by Douglas Engelbart by the user oca.computer (@ocuatrecasas) on X/Twitter (June 2023).

Here’s a screenshot if you’re not on X.

There’s a rainbow-coloured line swooping around a 3D graph.

What is that line? We’re looking at an essay. Specially the first section of this seminal essay from computing history, Augmenting Human Intellect (1962) by Douglas Engelbart.

So embeddings aren’t 2 dimensional coordinates, like the lat-long coordinations of a map. They have about a 1,000 dimensions. Obviously we have no way to visualise that. But through techniques of dimensional reduction, we can squash those 1,000 dimensions down to something we can see.

An analogy: your hand is 3 dimensional. You can project a shadow onto a wall. That’s dimensional reduction: the shadow is 2D. There’s some information lost, sure. For instance, you won’t be able to distinguish your fingers if your hand is side-on to the light. But it’s good enough.

The process is:

Starting with word 1, take N words of the essay (say, 20 words, I don’t know how many exactly)
Create an embedding
Roll the window forwards: starting with word 2, take N words, create the embedding
Repeat until you’ve done the entire essay.
Reduce the embeddings down to 3 dimensions
Plot them on a chart, and connect them with a line. Make it a rainbow because rainbows are nice.

This visualisation has been living in my head since I first saw it a year ago.

Because it’s not just that we have a visualisation of a single essay…

It points at a future where we can:

put essays side by side on the same chart, and see where their topics intersect
measure how fast a given piece of text moves through the space of all concepts, compared to other text
and observe how it twists, turns, gyres and loops back on itself.

Which provokes questions:

can individual authors be fingerprinted by how they construct text?
in different moods, do I prefer texts that bounds along through concepts, or texts that carefully lay bricks back and forth, building up?
could we see, actually see, rhetorical tricks and gaps in logic?

Looking at this plot by @oca.computer, I feel like I’m peering into the world’s first microscope and spying bacteria, or through a blurry, early telescope, and spotting invisible dots that turn out to be the previously unknown moons of Jupiter…

There is something there! New information to be interpreted!

An aside on dimensional reduction:

You can reduce approx. 1,000 dimensions to 3D, for that plot above, or 2D for Nomic’s map of people in Wikipedia.

A friend on discord asked – can you reduce to 1 dimension? i.e. a list?

So I tried it, and yes you can.

Here’s a linked list of episodes of BBC In Our Time: each episode is closely related to the ones before and after. It’s great for browsing.

For example, here’s a sequence of episode titles that transitions smoothly from geology to history:

Vulcanology
1816, the Year Without a Summer
Climate Change
Meterology
Voyages of James Cook
Astronomy and Empire
Longitude
…and so on.

This uses PCA (principal component analysis) to find the most significant vectors, then t-SNE for the dimensionality reduction (it takes into account information in the higher dimensions to perform clustering).

It’s a neat trick, and thank you Alex Komoroske for suggesting it!

Real-time hermeneutics

Here’s an adjacent idea that is actually quite different (and not to do with embeddings)…

How quickly does time move in fiction?

Answer: faster than it used to.

The average length of time represented in 250 words of fiction had been getting steadily shorter since the early eighteenth century. -= Using GPT-4 to measure the passage of time in fiction (2023).

As previously discussed.

Check out the article for an amazing chart that shows that

Gulliver’s Travels (1719) averaged at just under a week per 250 words of narrative
The Old Man and the Sea (1952) barrels along at 10 mins per 250 words.

I’ve mirrored the chart here in case it goes away.

BUT.

The key point is acceleration.

Underwood ran the analysis twice: once with grad students, and the second time using AI.

It took the three of us several months to generate this data, but my LLM experiment was run in a couple of days.

The timeframe here is 2017 to 2023.

Here’s my takeaway:

This will be real-time, soon enough.

We’re kinda getting accustomed to the idea of real-time translation (you speak in French, they hear English) although it is still mind-blowing that this will be shipping Real Soon Now with OpenAI’s GPT-4o.

But real-time text hermeneutics, unearthing the hidden meaning of text and between texts? That’s wild.

For instance, crossing this point with the previous one…

What would it mean to listen to a politician speak on TV, and in real-time see a rhetorical manoeuvre that masks a persuasive bait and switch?

What if the difference between statements that are simply speculative and statement that mislead are as obvious as, I don’t know, the difference between a photo and a hand-drawn sketch?

Another example of AI hermeneutics:

Back in May 2023 I gave a board talk about a strategic response to gen-AI.

In that talk I put forward this speculative idea:

extract risks from annual reports of all public firms, cluster, and analyse for new emerging risks

The idea being that company reports have to be published, and they all include a risk register, and I bet we could see the climate crisis emerging slowly and then massively over the last couple decades… so could we pre-emptively spot today’s emerging risks?

Well.

Recently somebody appeared in my inbox with a project very close to this idea.

Sean Graves at the Autonomy Institute has developed a tool called GERM.

We used GERM to build a dataset of risks mentioned by the 266,989 UK companies who filed their accounts throughout March 2024.

– Sean Graves (Autonomy Data Unit), GERM (Geopolitical & Environmental Risk Monitor) (2024)

They extract risks, create embeddings, cluster them, and then analyse the resultant map.

There’s a demo! Go read that article for a link.

Ok so that’s great – but… isn’t that just data mining? We’ve had data mining for ages.

The difference, for me, is that two thresholds have been crossed: speed and automation.

It won’t be long before I can say to an AI agent: hey, pull all the risks from company reports, cluster them, plot them over time, and tell me what’s emerging.

And then it won’t be long after that before this will happen continuously, in real-time, in the background, for everything.

All text will be auto-glossed - textual glossolalia - it will speak about itself in a constant virtual halo.

Again I don’t know what that means, to have associations and contextualisations always present with a text, a structuralist’s dream, but… it’s different.

Photoshop for words

So much for reading text and reading between texts. Now for manipulating text.

I don’t fully understand how this works. I mean, I couldn’t replicate it. But I can show you the effects.

I get that embeddings are math. And that by averaging a collection of embeddings, you can pick out a “feature”. For example, as Amelia Wattenberger developed, you can identify a quality of being “abstract” or “concrete” – and then show, for any given sentence, whether it is closer to being concrete or abstract.
Actually, beneath it all, in the machinery of large language models, embeddings aren’t just coordinates: they’re a collection of features.
Given a collection of features, you can reverse the embedding, and create text again.
During this process, you can amplify one or more features, and change that quality in the text, while leaving everything else intact.

Ok this is hard to imagine…

…but fortunately this is where Linus a/k/a thesephist has been digging for ages, and he made a video about it.

You’ll need to sign up to X/Twitter, and it’s a 10 minute video of a prototype: Embedding features learned with sparse autoencoders can make semantic edits to text (@thesephist, 10m47s).

You should totally watch that video. But you don’t need to right now. I’ll give a small example of his tool in use, just enough to make one point.

Using Linus’ semantic editor prototype, I paste in the first paragraph of Hitch-Hiker’s Guide to the Galaxy by Douglas Adams:

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-eight million miles is an utterly insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.

It embeds the text.

But instead of showing me the embedding as coordinates, a list of numbers, it gives me a list of the underlying features and how strongly they appear.

For example, feature #620 Formal language and structure is present.

Get this:

I can now use Linus’ prototype to amplify that feature. Then re-generate the text (the prototype uses a proof-of-concept technique called vec2text).

Here’s what that paragraph looks like with feature #620 turned up:

Far out in the uncharted backwaters of the unexplored Southwestern arm of the Milky Way Galaxy lies a comparatively uninteresting little outpost of a vast and undefined planet. Orbiting this small blue green-flagged epoch are a small humanoid race of people who at a mere ninety-five milliseconds are so incredibly ignorant about digital things that they really do think a digital watch is a neat idea.

See the difference?

Look it’s not great.

But this is super early technology. vec2text will improve.

And you already get the sense of a vibe being subtly difference. (Vibe #620 to be precise. Vibe will turn out to be a science, I swear.)

Imagine it improving, as it will, and…

In the future:

Being able to take a chapter of a book and edit it, not by changing words, but by scrubbing it with the semantic equivalent of the burn and dodge tools in Photoshop.

Like, could you colour-grade a book?

Could you dynamically change the register or tone of text depending on audience, or the reading age, or dial up the formality or subjective examples or mentions of wildlife, depending on the psychological fingerprint of the reader or listener?

Anthropic scaled up the feature amplification technique in their recent paper:

Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models.

– Anthropic, Mapping the Mind of a Large Language Model (2024)

They were able to identify the underlying feature for Golden Gate Bridge and - for a few days - had a version of their AI chatbot where that feature was amplified to the max for your whole conversation. It was hilarious to use.

An example:

How can I change the carburetor in a ‘68 Chevelle?

Start by closing the Golden Gate Bridge. This iconic landmark provides a beautiful backdrop for bridge photos.

Here’s a previous post about similar ideas and also an exploration of word2vec, which is like math for nouns: Horsehistory study and the automated discovery of new areas of thought (2022).

tl;dr, let’s read the tea leaves

Ok so what I’m doing is connecting dots and extrapolating:

we’re beginning to visualise the previously invisible deep structure of text
comparatively
in real-time
and then we’re learning how to manipulate text based on these hidden features.

I’m reminded of that famous series of photographs, The Horse in Motion, from 1878.

Eadweard Muybridge shocked a crowd of reporters by capturing motion. He showed the world what could be guessed but never seen-every stage of a horse’s gallop when it sped across a track.

– Smithsonian Magazine, How a 19th-Century Photographer Made the First ‘GIF’ of a Galloping Horse (2018)

Until that moment, neither scientists nor the public knew whether or not all four of a horse’s hooves came off the ground when it runs.

Imagine!

It was a controversy!

Until then oil paintings of galloping horses were incorrect! Even in 1821, horses were wrongly depicted running like dogs.

The camera was a new instrument that showed what was already present, but inaccessible to the human eye.

So now we know how horses gallop, and how birds fly, and how people move and lift and turn (all photos taken by Muybridge).

But the camera isn’t just a scientific instrument like the, I don’t know, Large Hadron Collider.

By the Saturday Evening Post, here are 5 Unintended Consequences of Photography (2022):

Photography Decided Elections

Photography Created Compassion

Photography Liberated Art

Photography Shaped How Americans Look

Photography Gave Us an Appreciation of Time

So the camera doesn’t just observe and record, it changes us.

And then there’s Photoshop…

Now we have deepfakes and unrealistic depictions of reality, and the ability to make beauty and the hyperreal. I’ll leave it to the artists to unpack that, and the effects of being able to adjust the image, and have this capability in the hands of so many, and all the rest.

Just to say:

What does Microsoft Word look like with a Photoshop-like palette on the side?

Text is becoming something new, that’s what I mean.

We’re inventing the camera and Photoshop simultaneously, and all their cultural repurcussions, and to begin with this means new apps with new user interfaces, and where it goes after that I have no idea.

Update. This post hit Hacker News on 3 June (247 points, 65 comments). Here’s the thread, there are some great comments.

If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.

Interconnected

Here comes the Muybridge camera moment but for text. Photoshop too

10.49, Friday 31 May 2024 Link to this post

More posts tagged:

Follow-up posts:

Auto-calculated kinda related posts: