Lares smart home assistant: A toy AI agent demonstrating emergent behavior

Matt Webb
interconnected.org (homepage)
Acts Not Facts (studio)
matt@interconnected.org

Abstract

Lares is a simulation of a smart home assistant, powered by a simple AI agent.

It exhibits problem solving abilities, despite its simplicity.

I aim to

New AI agents are being developed every day. My meta goal is to show that basic agents with surprisingly sophisticated emergent traits are straightforward to build with comparatively little code. (This paper is not aimed at engineers who are already working with agents.)

v1 DRAFT - 4 June 2024

tl;dr

An overview and details of the technical implementation make up the bulk of the text.

If you want the highlights:

Overview

Lares is an AI agent operating in a simulation of a smart home.

This section summarises what we'll see in the demos.

Scenario

Lares is based around a toy simulation of a smart home. It's just a simulation. The agent isn't a simulation, it uses AI. The Lares agent interacts with the smart home simulation via an API. It takes instructions via a web UI.

A graphic of three boxes: 'Web UI', which connects to 'Agent', which connects 'Smart Home Simulation.' The simulation box contains an API box and a State box.
Conceptual architecture

Demo UI

When the demo runs, the interaction is via this web UI.

A browser window showing a website. The left column, titled World, shows blue and yellow boxes named Hallway, Bedroom, and show on. The right column, titled Agent, has a text input field.
User interface

On the left is the live state of the simulated smart home:

On the right is the agent interface itself. The user can instruct the agent here and, when running, a transcript of agent actions is shown.

History

I built the first version of Lares in April 2023 at the London AI Hackathon and collaborated with Campbell Orme on user experience and comms. We won two of the five hackathon awards, from Google DeepMind and Amazon AWS.

It served as a proof of concept but didn't operate reliably.

This new version is based on OpenAI's function calling functionality and their GPT-4o model, rather than GPT-3 and the text-based ReAct pattern, and I'm sharing the code for the first time. (See Technical Implementation, below.)

Demos

Play the videos below to see the agent in action.

Simple instruction following

We instruct the agent to turn on the light in the kitchen.

It figures out the correct function to call to toggle the light.

"turn on the light in the kitchen"

Transcript

The agent...

Problem solving

We instruct the agent to turn on the light for my dog.

The agent doesn't know where the dog is, so it has to look via the robot, moving it into various rooms. However it realises that the robot can't see in the dark, so it turns on the lights. This action is not pre-programmed.

"turn on the light for my dog"

Transcript

The agent...

On some runs I have observed the agent attempt to use lookWithRobot without turning on the lights, at which point it receives an error message which reads: It's too dark to see anything. This is enough to prompt the agent to toggle the lights before looking again.

Agent-initiated user interaction

We instruct the agent to set the light in the office depending on if it's daytime.

It doesn't know whether it's daytime, so it asks the user a clarifying question. Then it decides to turn off the light because there's daylight outside. (I wish everyone in my home did the same.)

"set the light in the office depending on if it's daytime"

Transcript

The agent...

Technical Implementation

What makes an agent an agent is:

  1. Tool use. The AI replies using specially structured text that is interpreted and executed by the agent system ("function calls" or "tools") instead of prose aimed at the user.
  2. Autonomy. The AI is in charge of deciding what to do next. It loops, calling functions, until it decides to halt.

The emergent behavior, such as problem solving, comes from the large language model (the AI) making use the functions.

The following sections run through the implementation in detail. If you've developed agents before, jump ahead to two tactics that increase reliability: Separating Intent and Action and Managing World Knowledge.

Code

The code can be viewed or downloaded from this GitHub repo: github.com/genmon/lares 3

app/
The user interface is a client-side React app.
party/assistant.ts
The agent runs as a PartyKit 4 server for coordination. (PartyKit is a framework for writing tiny stateful servers, ideal for agent orchestration.) It makes use of the OpenAI Chat Completion API to produce function calls. It communicates with the UI via WebSockets. It interfaces with the smart home simulation via an HTTP API.
party/server.ts
The smart home simulation runs as a PartyKit server that maintains a hidden world state, and offers an HTTP API to query and act on the sim.

Agent Lifecycle

The user gives the agent an instruction.

A single turn of the loop consists of these steps:

  1. Setup. A transcript is prepared including:
    • The user instruction
    • The agent's up-to-date world knowledge (it knows which lights are on and off, but not the location of non-networked moveable actors, such as the dog)
    • Historic function calls and return values, over the current loop
  2. Intent. The agent is asked to call decideBestFunction to choose which function to call
  3. Action. The agent is asked for parameters to call its chosen function.

If the chosen function was halt, the agent stops and replies to the user.

If the chosen function was anything else, the function is dispatched, the return value added to the transcript, and the loop repeats.

OpenAI API and Prior Art

The concept of a chat transcript is standard in OpenAI's Chat Completions API 5.

Function calling is also standard in OpenAI's API, including the architecture that the runtime to handle the function call and append the function return to the transcript. See OpenAI's Function Calling guide 6. Function definitions, calls, and returns are encoded using JSON.

This demo does not use OpenAI's Assistants API 7 as it is simpler to use Chat Completion to match our desired agent lifecycle.

Function calling is a formalisation of the ReAct pattern 8 by Google Research and the Princeton NLP Group, published in November 2022.

Available Functions

Function calls are how the agent interacts with its runtime system, and acts in the world. The result of the function call is returned to the AI.

The agent loop for Lares understands six functions.

(If you're looking at the code, these functions are declared in party/functions.ts.)

Constructing the Transcript

On each turn of the loop, a transcript is constructed. The transcript is the prompt for the AI.

(If you're looking at the code, this is done in the agent code in party/assistant.ts.)

role content Notes
system You are a helpful AI assistant controlling a smart home. When the user refers to 'you' they either mean the entire home or the moveable robot that you control. You always know the layout of the home, the devices in it (which includes the robot), and their current states. You do not know the position of moveable items such as people, animals, and objects that might be carried, and you know even know their names initially. You can build your knowledge by using various devices. You call functions to control the devices in the home.
system The current state of the home follows. The state of devices is up to date with your most recent functions. Consult the transcript for any other discoveries you have made:

{ … }
See below.
system The user's instruction follows. Your goal is to fulfil it to the best of your ability. It may take a sequence of many instructions to achieve your goal, and you may have to deliberately build knowledge so you know enough to reach the goal. At each step, call the best function to move you closer to your goal. When you're done, call the halt function.
user Instruction: ... e.g. Instruction: turn on the kitchen light
The top of the prompt as given to the AI model.

The history of interactions is then appended.

For example, for the instruction turn on the light in the lounge, the first loop may result in the following addition to the transcript.

role content or tool_calls Notes
assistant I have decided to call the function toggleLight because: To turn on the light in the lounge, I can directly use the toggleLight function since the room and light states are known.
assistant [ { "id": "call_RxEcwvfYALQv0nTsrhOtDhNT", "type": "function", "function": { "name": "toggleLight", "arguments": "{\"lightId\":\"lounge-light\"}" } } ]
tool { "success": "Light with id <lounge-light> in room with id <lounge> is now: on" }
The function call history is appended to the standard prompt.

Separating Intent and Action Increases Reliability

The agent is asked which function to use by first asking it to use the function decideBestFunction.

The following messages are temporarily added to the transcript.

role content Notes
system You now have the opportunity to decide how best to respond. Choose which function will be best.
user Which of these functions will best move you closer to your goal?

- [function name] — [function description]
The five action functions are listed here

Recall that decideBestFunction has two parameters:

The reasoning parameter is an opportunity for chain-of-thought 9a series of intermediate reasoning steps which improves reliability.

There is no return value for the intent function call. The call is not added to the transcript. This is to avoid distracting the agent in subsequent steps.

Instead a summary is added to the transcript, and a prompt to call the chosen action function.

role content Notes
assistant I have decided to call the function [name] because: [reasoning]
system Call the available function to move closer to your goal. You may have to add arguments. Only use a function call. Do not reply with a text message.

Only a single action function is provided (e.g. moveRobot). The agent is not allowed to decided between tools at this point, only complete the parameters.

Decoupling intent and action appears to improve reliability.

World Knowledge is Managed Outside the Transcript, Improving Reliability

The transcript in each turn of the loop is not a historic transcript. Although it resembles a chat transcript (which is append-only), it is dynamically constructed.

In particular, the second message is the current state of the home.

This is:

Specifically: the hidden world state includes moveable items (like people and animals), whereas the world knowledge does not.

If you're running the code, compare:

Supplying the latest state increases reliability, although it does sometimes confuse the agent regarding the order of historic events (e.g. it may say the light in the office was already on when it had turned on the light itself).

Potential Applications

The agent approach is widely applicable.

Even this basic approach would be suitable for:

Observations and Speculations

Considering the potential applications, what I've learned with this Lares demo, and the general applicability of agentive systems, I have a number of observations.

Systems will need an agent-facing API.

Whereas some APIs are fine-grained to allowed for a front-end UI, and some APIs are terse and machine-readable for integration with other systems, an agent-facing API has specific requirements.

This is sufficiently different from regular APIs that they will need to be designed and maintained specifically.

Agent runtimes should provide agent abstractions

Chat is not a good abstraction for agents as, although it allows for function calling, it is optimised for an append-only transcript.

An agent-focused system would have abstractions for:

Also:

Non-JSON encoding. Encoded in JSON, function calls only decode cleanly when complete. This means it is not straightforward to use function calls to stream messages to the user.

Ideally this would move high-performance structured generation 11 (the Python library outlines 12 achieves this by managing the next-token probability according to a specified grammar).

Embedded agents lead to new demands on technology and UI

The simplicity of the agent approach means that devices such as stoves, lamps, and timers could have sophisticated interfaces without being connected to the internet.

It could even include voice control 13.

It provokes some interesting routes for research:

Pursuing agents in multiple scenarios will reveal more technical and UX challenges.

Function marketplaces and standards

Because of the ease of implementing agents, we may anticipate a future where small agents are widespread, rather than single, monolithic agents.

With a proliferation of agents, the challenge will be in coordination and adapting to new environments, and in implementing and discovering functions.

For example, how does my personal AI pendent (with built-in agent) know how to turn on the TV in an Airbnb that I haven't previously visited? If I ask my HVAC to prepare for tomorrow's weather, what function should it use to fetch the forecast?

We will need search engines or marketplaces for agent functions, and standards for calling conventions and agent-readable function documentation.

Considerable the search engine or marketplace that returns the most appropriate TV remote control function for the Airbnb I'm staying in. This will have to consider signals such as the device model, the interfaces I've used before, what's been used historically by people in this location, and so on. How can these signals can shared and trained against in a privacy-preserving fashion?

I have discussed Marketplaces for AI agents 14 in detail elsewhere. One search engine embodiment is described in a patent: Shared Experiences 15.

What is the origin of the observed emergent problem solving abilities?

Given the simplicity of the agent lifecycle code and functions, the emergent abilities of the agent must arise from the large language model itself. This provokes several research questions:

For example, on the strategy question, the problem solving strategy right now is a basic step by step "hill climbing" approach. Could the agent be encouraged to make plans to look further ahead, or test the feasibility of different approaches before engaging?

Conclusions

Even simple agents can exhibit emergent behavior such as basic problem solving. There is very little code in Lares outside the large language model itself. There are many sophisticated AI agents available today, and the simplicity of the code implies that many more will be developed.

So we can anticipate a future proliferation of agents. This has implications and opportunities: for user experience, standards/interop, underlying technology platforms, and business.

Acknowledgements

Thank you PartyKit 4 for supporting this work.

References

Lares: AI-powered smart home at the London Generative AI Hackathon, Matt Webb, Acts Not Facts (April 2023) Unpacking Lares: our 2 minute pitch for an AI-powered slightly-smart home, Matt Webb, Interconnected (April 2023) Lares: A simple AI agent controlling a simulation of a smart home, Matt Webb, GitHubPartyKit Chat Completions API documentation, OpenAI Function Calling guide, OpenAI Assistants API documentation, OpenAI ReAct: Synergizing Reasoning and Acting in Language Models, Google Research (November 2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Google Research (January 2022) Home Assistant How fast can grammar-structured generation be?, .txt blog (May 2024) Outlines: Robust prompting & (structured) text generation How I would put voice control in everything, Matt Webb, Interconnected (May 2020) Who will build new search engines for new personal AI agents?, Matt Webb, Interconnected (March 2024) Shared Experiences (WO2018164781A1), Google Patents (January 2018)