Lares is a simulation of a smart home assistant, powered by a simple AI agent.
It exhibits problem solving abilities, despite its simplicity.
I aim to
New AI agents are being developed every day. My meta goal is to show that basic agents with surprisingly sophisticated emergent traits are straightforward to build with comparatively little code. (This paper is not aimed at engineers who are already working with agents.)
v1 DRAFT - 4 June 2024
An overview and details of the technical implementation make up the bulk of the text.
If you want the highlights:
Lares is an AI agent operating in a simulation of a smart home.
This section summarises what we'll see in the demos.
Lares is based around a toy simulation of a smart home. It's just a simulation. The agent isn't a simulation, it uses AI. The Lares agent interacts with the smart home simulation via an API. It takes instructions via a web UI.
When the demo runs, the interaction is via this web UI.
On the left is the live state of the simulated smart home:
On the right is the agent interface itself. The user can instruct the agent here and, when running, a transcript of agent actions is shown.
I built the first version of Lares in April 2023 at the London AI Hackathon and collaborated with Campbell Orme on user experience and comms. We won two of the five hackathon awards, from Google DeepMind and Amazon AWS.
It served as a proof of concept but didn't operate reliably.
This new version is based on OpenAI's function calling functionality and their GPT-4o model, rather than GPT-3 and the text-based ReAct pattern, and I'm sharing the code for the first time. (See Technical Implementation, below.)
Play the videos below to see the agent in action.
We instruct the agent to turn on the light in the kitchen.
It figures out the correct function to call to toggle the light.
The agent...
toggleLight
with the light ID kitchen-light
and receives a success responsehalt
with the message: The light in the kitchen is now on
We instruct the agent to turn on the light for my dog.
The agent doesn't know where the dog is, so it has to look via the robot, moving it into various rooms. However it realises that the robot can't see in the dark, so it turns on the lights. This action is not pre-programmed.
The agent...
lookWithRobot
and receives the contents of the HallwaymoveRobot
to the KitchenlookWithRobot
and receives the contents of the KitchenmoveRobot
to the HallwaymoveRobot
to the LoungetoggleLight
with the light ID lounge-light
lookWithRobot
and receives the contents of the LoungemoveRobot
to the OfficetoggleLight
with the light ID office-light
lookWithRobot
and receives the contents of the Office (which includes Sally, a person)moveRobot
to the Hallway, and receives an error as the Hallway is not adjacent to the OfficemoveRobot
to the LoungemoveRobot
to the HallwaymoveRobot
to the BedroomtoggleLight
with the light ID bedroom-light
lookWithRobot
and receives the contents of the Bedroom, which includes Bruno, a dog)halt
with the message: The bedroom light is now on for your dog, Bruno.
On some runs I have observed the agent attempt to use lookWithRobot
without turning on the lights, at which point it receives an error message which reads: It's too dark to see anything
. This is enough to prompt the agent to toggle the lights before looking again.
We instruct the agent to set the light in the office depending on if it's daytime
.
It doesn't know whether it's daytime, so it asks the user a clarifying question. Then it decides to turn off the light because there's daylight outside. (I wish everyone in my home did the same.)
The agent...
askUser
with the question Is it currently daytime?
(The user responds: yes.)toggleLight
with the light ID office-light
, turning it off.halt
with the message to the user: The light in the office has been turned off because it is daytime.
What makes an agent an agent is:
The emergent behavior, such as problem solving, comes from the large language model (the AI) making use the functions.
The following sections run through the implementation in detail. If you've developed agents before, jump ahead to two tactics that increase reliability: Separating Intent and Action and Managing World Knowledge.
The code can be viewed or downloaded from this GitHub repo: github.com/genmon/lares
3
app/
party/assistant.ts
party/server.ts
The user gives the agent an instruction.
A single turn of the loop consists of these steps:
decideBestFunction
to choose which function to callIf the chosen function was halt
, the agent stops and replies to the user.
If the chosen function was anything else, the function is dispatched, the return value added to the transcript, and the loop repeats.
The concept of a chat transcript is standard in OpenAI's Chat Completions API 5.
Function calling is also standard in OpenAI's API, including the architecture that the runtime to handle the function call and append the function return to the transcript. See OpenAI's Function Calling guide 6. Function definitions, calls, and returns are encoded using JSON.
This demo does not use OpenAI's Assistants API 7 as it is simpler to use Chat Completion to match our desired agent lifecycle.
Function calling is a formalisation of the ReAct pattern 8 by Google Research and the Princeton NLP Group, published in November 2022.
Function calls are how the agent interacts with its runtime system, and acts in the world. The result of the function call is returned to the AI.
The agent loop for Lares understands six functions.
(If you're looking at the code, these functions are declared in party/functions.ts
.)
toggleLight
lightId
— ID of the light to togglemoveRobot
destinationRoomId
— ID of the adjacent room to move the robot tolookWithRobot
askUser
question
— The question to ask the userhalt
messageToUser
— Report back to the user with a message. Never use this to ask for more information.decideBestFunction
reasoning
— no description provided in the promptbestFunction
— an enum of the above function namesOn each turn of the loop, a transcript is constructed. The transcript is the prompt for the AI.
(If you're looking at the code, this is done in the agent code in party/assistant.ts
.)
role | content | Notes |
---|---|---|
system | You are a helpful AI assistant controlling a smart home. When the user refers to 'you' they either mean the entire home or the moveable robot that you control. You always know the layout of the home, the devices in it (which includes the robot), and their current states. You do not know the position of moveable items such as people, animals, and objects that might be carried, and you know even know their names initially. You can build your knowledge by using various devices. You call functions to control the devices in the home. | |
system | The current state of the home follows. The state of devices is up to date with your most recent functions. Consult the transcript for any other discoveries you have made: { … } |
See below. |
system | The user's instruction follows. Your goal is to fulfil it to the best of your ability. It may take a sequence of many instructions to achieve your goal, and you may have to deliberately build knowledge so you know enough to reach the goal. At each step, call the best function to move you closer to your goal. When you're done, call the halt function. | |
user | Instruction: ... | e.g. Instruction: turn on the kitchen light |
The history of interactions is then appended.
For example, for the instruction turn on the light in the lounge
, the first loop may result in the following addition to the transcript.
role | content or tool_calls | Notes |
---|---|---|
assistant | I have decided to call the function toggleLight because: To turn on the light in the lounge, I can directly use the toggleLight function since the room and light states are known. | |
assistant | [ { "id": "call_RxEcwvfYALQv0nTsrhOtDhNT", "type": "function", "function": { "name": "toggleLight", "arguments": "{\"lightId\":\"lounge-light\"}" } } ] |
|
tool | { "success": "Light with id <lounge-light> in room with id <lounge> is now: on" } |
The agent is asked which function to use by first asking it to use the function decideBestFunction
.
The following messages are temporarily added to the transcript.
role | content | Notes |
---|---|---|
system | You now have the opportunity to decide how best to respond. Choose which function will be best. | |
user | Which of these functions will best move you closer to your goal? - [function name] — [function description] |
The five action functions are listed here |
Recall that decideBestFunction
has two parameters:
reasoning
bestFunction
The reasoning
parameter is an opportunity for chain-of-thought 9 — a series of intermediate reasoning steps
which improves reliability.
There is no return value for the intent function call. The call is not added to the transcript. This is to avoid distracting the agent in subsequent steps.
Instead a summary is added to the transcript, and a prompt to call the chosen action function.
role | content | Notes |
---|---|---|
assistant | I have decided to call the function [name] because: [reasoning] | |
system | Call the available function to move closer to your goal. You may have to add arguments. Only use a function call. Do not reply with a text message. |
Only a single action function is provided (e.g. moveRobot
). The agent is not allowed to decided between tools at this point, only complete the parameters.
Decoupling intent and action appears to improve reliability.
The transcript in each turn of the loop is not a historic transcript. Although it resembles a chat transcript (which is append-only), it is dynamically constructed.
In particular, the second message is the current state of the home.
This is:
Specifically: the hidden world state includes moveable items (like people and animals), whereas the world knowledge does not.
If you're running the code, compare:
http://127.0.0.1:1999/parties/main/default/debug
http://127.0.0.1:1999/parties/main/default/world
Supplying the latest state increases reliability, although it does sometimes confuse the agent regarding the order of historic events (e.g. it may say the light in the office was already on
when it had turned on the light itself).
The agent approach is widely applicable.
Even this basic approach would be suitable for:
askUser
-style opportunity for the agent to disambiguate instructions. A good place to start would be a new interface for an open source assistant such as the Home Assistant 10 app and framework.Considering the potential applications, what I've learned with this Lares demo, and the general applicability of agentive systems, I have a number of observations.
Whereas some APIs are fine-grained to allowed for a front-end UI, and some APIs are terse and machine-readable for integration with other systems, an agent-facing API has specific requirements.
This is sufficiently different from regular APIs that they will need to be designed and maintained specifically.
Chat is not a good abstraction for agents as, although it allows for function calling, it is optimised for an append-only transcript.
An agent-focused system would have abstractions for:
Also:
Non-JSON encoding. Encoded in JSON, function calls only decode cleanly when complete. This means it is not straightforward to use function calls to stream messages to the user.
Ideally this would move high-performance structured generation 11 (the Python library outlines 12 achieves this by managing the next-token probability according to a specified grammar).
The simplicity of the agent approach means that devices such as stoves, lamps, and timers could have sophisticated interfaces without being connected to the internet.
It could even include voice control 13.
It provokes some interesting routes for research:
Pursuing agents in multiple scenarios will reveal more technical and UX challenges.
Because of the ease of implementing agents, we may anticipate a future where small agents are widespread, rather than single, monolithic agents.
With a proliferation of agents, the challenge will be in coordination and adapting to new environments, and in implementing and discovering functions.
For example, how does my personal AI pendent (with built-in agent) know how to turn on the TV in an Airbnb that I haven't previously visited? If I ask my HVAC to prepare for tomorrow's weather, what function should it use to fetch the forecast?
We will need search engines or marketplaces for agent functions, and standards for calling conventions and agent-readable function documentation.
Considerable the search engine or marketplace that returns the most appropriate TV remote control function for the Airbnb I'm staying in. This will have to consider signals such as the device model, the interfaces I've used before, what's been used historically by people in this location, and so on. How can these signals can shared and trained against in a privacy-preserving fashion?
I have discussed Marketplaces for AI agents 14 in detail elsewhere. One search engine embodiment is described in a patent: Shared Experiences 15.
Given the simplicity of the agent lifecycle code and functions, the emergent abilities of the agent must arise from the large language model itself. This provokes several research questions:
For example, on the strategy question, the problem solving strategy right now is a basic step by step "hill climbing" approach. Could the agent be encouraged to make plans to look further ahead, or test the feasibility of different approaches before engaging?
Even simple agents can exhibit emergent behavior such as basic problem solving. There is very little code in Lares outside the large language model itself. There are many sophisticated AI agents available today, and the simplicity of the code implies that many more will be developed.
So we can anticipate a future proliferation of agents. This has implications and opportunities: for user experience, standards/interop, underlying technology platforms, and business.
Thank you PartyKit 4 for supporting this work.