The need for a strategic fact reserve (Interconnected)

The need for a strategic fact reserve

20.22, Friday 17 Jan 2025 Link to this post

Countries - and state-scale corporates - should each be working on their own Strategic Fact Reserve. If they’re not already.

Let me trace the logic…

AI as cognitive infrastructure

Assume for a moment that AI is upstream of productive work. It’ll start off that people just go faster with AI, and then it’ll become essential. I remember hearing that legal contracts got longer when word processors were introduced (I’d love to find the reference for this); you couldn’t handle a contract without a computer today. Same same.

So if I were to rank AI (not today’s AI but once it is fully developed and integrated) I’d say it’s probably not as critical as infrastructure or capacity as energy, food or an education system.

But probably it’s probably on par with GPS. Which underpins everything from logistics to automating train announcements to retail.

Is it important for a country to own its own AI?

I just mentioned GPS. The EU has its own independent satellite positioning system called Galileo. Which makes sense. It would be an unfortunate choke point if GPS chips suddenly cost more for non-US companies, say. Or if European planes couldn’t safely land into perhaps Greenland, due to some plausibly deniable irregular system degradation.

Diplomacy, speak softly and carry a big stick, right?

With AI? It’s far-fetched but maybe degrading it would knock 20 points off the national IQ?

But from a soft power perspective…

We’ll be using AIs to automate business logic with value judgements (like: should this person get a mortgage? Or parole?) and also to write corporate strategy and government policy.

No, this isn’t desirable necessarily. We won’t build this into the software deliberately. But a generation of people are growing up with AI as a cognitive prosthesis and they’ll use it whether we like it or not.

However. Large Language Models Reflect the Ideology of their Creators (arXiv, 2024):

By identifying and analyzing moral assessments reflected in the generated descriptions, we find consistent normative differences between how the same LLM responds in Chinese compared to English. Similarly, we identify normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. Furthermore, popularly hypothesized disparities in political goals among Western models are reflected in significant normative differences related to inclusion, social inequality, and political scandals.

Our results show that the ideological stance of an LLM often reflects the worldview of its creators.

This is a predictable result? AIs are trained! Chinese large language models will give China-appropriate answers; American models American! Of course!

What if you’re a Northern European social democracy and your policy papers (written by graduates who are pasting their notes into ChatGPT to quickly write fancy prose) are, deep down, sceptical that, yes, citizens really will adhere to the social contract?

All of the above will not matter until suddenly it really matters.

Which means it’s important to retain independent capacity to stand up new AIs.

Can you be sure that trusted training data will be available?

What you need to build a new AI: expertise; supercomputers; training data.

The capacity for the first two can be built or bought in.

But training data… I think we’re all assuming that the Internet Archive will remain available as raw feedstock, that Wikipedia will remain as a trusted source of facts to steer it; that there won’t be a shift in copyright law that makes it impossible to mulch books into matrices, and that governments will allow all of this data to cross borders once AI becomes part of national security.

There are so many failure modes.

I was kinda kidding but kinda not that the Internet Archive could be retroactively corrupted
The web itself is increasingly contaminated groundwater: the economics of SEO means the web is getting drenched in AI-produced content (bad for training data) and there are deliberate and hilarious tarpits to confuse the AI training systems.
Wikipedia is under attack for political reasons

And that’s not even getting into the AI-generated scientific research papers…

Or, technically, what if the get-out of using reasoning models to generate synthetic training data for future more advanced modes - which works really well now - is domain-specific, or stops working after a couple cycles: AI models collapse when trained on recursively generated data.

Or what if, in the future, clean training data does exist – but it’s hoarded and costs too much, or you can only afford a degraded fraction of what you need.

What I mean to say is: if in 2030 you need to train a new AI, there’s no guarantee that the data would be available.

Everything I’ve said is super low likelihood, but the difficulty with training data is that you can’t spend your way out of the problem in the future. The time to prepare is now.

It’s a corporate problem and a national security problem

I’ve been speaking from the perspective of national interests, but this is equally a lever for one trillion dollar market cap corporate against another.

OpenStreetMap for facts

Coming back to GPS, somebody who realised the importance of mapping data very, very early was Steve Coast and in 2004 he founded OpenStreetMap (Wikipedia). OSM is the free, contributor-based mapping layer that - I understand - kept both Microsoft and Apple in the mapping game, and prevented it mapping from becoming a Google monopoly.

ASIDE #1. Shout out to fellow participants of the locative media movement and anyone who remembers Ben Russell’s stunning headmap manifesto (PDF) from 1999. AI desperately needs this analysis of possibilities and power.

ASIDE #2. I often come back to mapping as an analogy for large language models. There are probably half a dozen global maps in existence. I don’t know how much they cost, but let’s guess a billion to create and a billion a year to maintain, order of magnitude. A top class AI model is probably the same, all in. So we can expect similar dynamics.

OpenStreetMap was the bulwark we needed then.

Today what we need is probably something different. Not something open but - perhaps - something closed.

We need the librarians

The future needs trusted, uncontaminated, complete training data.

From the point of view of national interests, each country (or each trading bloc) will need its own training data, as a reserve, and a hedge against the interests of others.

Probably the best way to start is to take a snapshot of the internet and keep it somewhere really safe. We can sift through it later; the world’s data will never be more available or less contaminated than it is today. Like when GitHub stored all public code in an Arctic vault (02/02/2020): a very-long-term archival facility 250 meters deep in the permafrost of an Arctic mountain. Or the Svalbard Global Seed Vault.

But actually I think this is a job for librarians and archivists.

What we need is a long-term national programme to slowly, carefully accept digital data into a read-only archive. We need the expertise of librarians, archivists and museums in the careful and deliberate process of acquisition and accessioning (PDF).

(Look and if this is an excuse for governments to funnel money to the cultural sector then so much the better.)

It should start today.

If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.

Interconnected

The need for a strategic fact reserve

20.22, Friday 17 Jan 2025 Link to this post

Follow-up posts:

Auto-calculated kinda related posts: