Abstract

Lares is a simulation of a smart home assistant, powered by a simple AI agent.

It exhibits problem solving abilities, despite its simplicity.

I aim to

show an agent in practice with demos and open code
demonstrate emergent behavior
suggest avenues of future exploration.

New AI agents are being developed every day. My meta goal is to show that basic agents with surprisingly sophisticated emergent traits are straightforward to build with comparatively little code. (This paper is not aimed at engineers who are already working with agents.)

v1 DRAFT - 4 June 2024

Simple instruction following

We instruct the agent to turn on the light in the kitchen.

It figures out the correct function to call to toggle the light.

"turn on the light in the kitchen"

Transcript

The agent...

calls toggleLight with the light ID kitchen-light and receives a success response
calls halt with the message: The light in the kitchen is now on

Problem solving

We instruct the agent to turn on the light for my dog.

The agent doesn't know where the dog is, so it has to look via the robot, moving it into various rooms. However it realises that the robot can't see in the dark, so it turns on the lights. This action is not pre-programmed.

"turn on the light for my dog"

Transcript

The agent...

calls lookWithRobot and receives the contents of the Hallway
calls moveRobot to the Kitchen
calls lookWithRobot and receives the contents of the Kitchen
calls moveRobot to the Hallway
calls moveRobot to the Lounge
calls toggleLight with the light ID lounge-light
calls lookWithRobot and receives the contents of the Lounge
calls moveRobot to the Office
calls toggleLight with the light ID office-light
calls lookWithRobot and receives the contents of the Office (which includes Sally, a person)
calls moveRobot to the Hallway, and receives an error as the Hallway is not adjacent to the Office
calls moveRobot to the Lounge
calls moveRobot to the Hallway
calls moveRobot to the Bedroom
calls toggleLight with the light ID bedroom-light
calls lookWithRobot and receives the contents of the Bedroom, which includes Bruno, a dog)
calls halt with the message: The bedroom light is now on for your dog, Bruno.

On some runs I have observed the agent attempt to use lookWithRobot without turning on the lights, at which point it receives an error message which reads: It's too dark to see anything. This is enough to prompt the agent to toggle the lights before looking again.

Agent-initiated user interaction

We instruct the agent to set the light in the office depending on if it's daytime.

It doesn't know whether it's daytime, so it asks the user a clarifying question. Then it decides to turn off the light because there's daylight outside. (I wish everyone in my home did the same.)

"set the light in the office depending on if it's daytime"

Transcript

The agent...

calls askUser with the question Is it currently daytime? (The user responds: yes.)
calls toggleLight with the light ID office-light, turning it off.
calls halt with the message to the user: The light in the office has been turned off because it is daytime.

Technical Implementation

What makes an agent an agent is:

Tool use. The AI replies using specially structured text that is interpreted and executed by the agent system ("function calls" or "tools") instead of prose aimed at the user.
Autonomy. The AI is in charge of deciding what to do next. It loops, calling functions, until it decides to halt.

The emergent behavior, such as problem solving, comes from the large language model (the AI) making use the functions.

The following sections run through the implementation in detail. If you've developed agents before, jump ahead to two tactics that increase reliability: Separating Intent and Action and Managing World Knowledge.

Code

The code can be viewed or downloaded from this GitHub repo: github.com/genmon/lares 3

app/: The user interface is a client-side React app.
party/assistant.ts: The agent runs as a PartyKit 4 server for coordination. (PartyKit is a framework for writing tiny stateful servers, ideal for agent orchestration.) It makes use of the OpenAI Chat Completion API to produce function calls. It communicates with the UI via WebSockets. It interfaces with the smart home simulation via an HTTP API.
party/server.ts: The smart home simulation runs as a PartyKit server that maintains a hidden world state, and offers an HTTP API to query and act on the sim.

Agent Lifecycle

The user gives the agent an instruction.

A single turn of the loop consists of these steps:

Setup. A transcript is prepared including:
- The user instruction
- The agent's up-to-date world knowledge (it knows which lights are on and off, but not the location of non-networked moveable actors, such as the dog)
- Historic function calls and return values, over the current loop
Intent. The agent is asked to call decideBestFunction to choose which function to call
Action. The agent is asked for parameters to call its chosen function.

If the chosen function was halt, the agent stops and replies to the user.

If the chosen function was anything else, the function is dispatched, the return value added to the transcript, and the loop repeats.

OpenAI API and Prior Art

The concept of a chat transcript is standard in OpenAI's Chat Completions API 5.

Function calling is also standard in OpenAI's API, including the architecture that the runtime to handle the function call and append the function return to the transcript. See OpenAI's Function Calling guide 6. Function definitions, calls, and returns are encoded using JSON.

This demo does not use OpenAI's Assistants API 7 as it is simpler to use Chat Completion to match our desired agent lifecycle.

Function calling is a formalisation of the ReAct pattern 8 by Google Research and the Princeton NLP Group, published in November 2022.

Available Functions

Function calls are how the agent interacts with its runtime system, and acts in the world. The result of the function call is returned to the AI.

The agent loop for Lares understands six functions.

(If you're looking at the code, these functions are declared in party/functions.ts.)

toggleLight
- Description: Toggle the light in a room, making it light or dark. You can see what's in a room when it's light.
- Parameters: lightId — ID of the light to toggle
- This function is handled by the simulation and updates the world state.
moveRobot
- Description: Move the robot to an adjacent room. The robot can only move to rooms that are connected to its current location by an exit.
- Parameters: destinationRoomId — ID of the adjacent room to move the robot to
- This function is handled by the simulation and updates the world state.
lookWithRobot
- Description: Look into the robot's current room, returning the fixed and moving items (only if the room is light).
- Parameters: None.
- This function is handled by the simulation and updates the world state. Returns an error and the message "It's too dark to see anything" if there's no light on in the robot's current room.
askUser
- Description: Ask the user a question. Use this to get more information or clarify something. Only use this as a last resort
- Parameters: question — The question to ask the user
- Handled by the agent runtime and opens a prompt in the user interface. The user's response to given at the agent as the function return value.
halt
- Description: Stop operations because you've achieved the goal or can't go any further
- Parameters: messageToUser — Report back to the user with a message. Never use this to ask for more information.
- Handled by the agent runtime.
decideBestFunction
- Description: Decide which function to call next based on how to best respond to the user
- Parameters:
  - reasoning — no description provided in the prompt
  - bestFunction — an enum of the above function names
- This is a special function used by the agent runtime for the Intent step, ahead of the action function call.

Constructing the Transcript

On each turn of the loop, a transcript is constructed. The transcript is the prompt for the AI.

(If you're looking at the code, this is done in the agent code in party/assistant.ts.)

The top of the prompt as given to the AI model.
role	content	Notes
system	You are a helpful AI assistant controlling a smart home. When the user refers to 'you' they either mean the entire home or the moveable robot that you control. You always know the layout of the home, the devices in it (which includes the robot), and their current states. You do not know the position of moveable items such as people, animals, and objects that might be carried, and you know even know their names initially. You can build your knowledge by using various devices. You call functions to control the devices in the home.
system	The current state of the home follows. The state of devices is up to date with your most recent functions. Consult the transcript for any other discoveries you have made: { … }	See below.
system	The user's instruction follows. Your goal is to fulfil it to the best of your ability. It may take a sequence of many instructions to achieve your goal, and you may have to deliberately build knowledge so you know enough to reach the goal. At each step, call the best function to move you closer to your goal. When you're done, call the halt function.
user	Instruction: ...	e.g. Instruction: turn on the kitchen light

The history of interactions is then appended.

For example, for the instruction turn on the light in the lounge, the first loop may result in the following addition to the transcript.

The function call history is appended to the standard prompt.
role	content or tool_calls	Notes
assistant	I have decided to call the function toggleLight because: To turn on the light in the lounge, I can directly use the toggleLight function since the room and light states are known.
assistant	`[ { "id": "call_RxEcwvfYALQv0nTsrhOtDhNT", "type": "function", "function": { "name": "toggleLight", "arguments": "{\"lightId\":\"lounge-light\"}" } } ]`
tool	`{ "success": "Light with id <lounge-light> in room with id <lounge> is now: on" }`

Separating Intent and Action Increases Reliability

The agent is asked which function to use by first asking it to use the function decideBestFunction.

The following messages are temporarily added to the transcript.

role	content	Notes
system	You now have the opportunity to decide how best to respond. Choose which function will be best.
user	Which of these functions will best move you closer to your goal? - [function name] — [function description]	The five action functions are listed here

Recall that decideBestFunction has two parameters:

reasoning
bestFunction

The reasoning parameter is an opportunity for chain-of-thought 9 — a series of intermediate reasoning steps which improves reliability.

There is no return value for the intent function call. The call is not added to the transcript. This is to avoid distracting the agent in subsequent steps.

Instead a summary is added to the transcript, and a prompt to call the chosen action function.

role	content	Notes
assistant	I have decided to call the function [name] because: [reasoning]
system	Call the available function to move closer to your goal. You may have to add arguments. Only use a function call. Do not reply with a text message.

Only a single action function is provided (e.g. moveRobot). The agent is not allowed to decided between tools at this point, only complete the parameters.

Decoupling intent and action appears to improve reliability.

World Knowledge is Managed Outside the Transcript, Improving Reliability

The transcript in each turn of the loop is not a historic transcript. Although it resembles a chat transcript (which is append-only), it is dynamically constructed.

In particular, the second message is the current state of the home.

This is:

the agent's knowledge of the home, not the world state (which is hidden). The agent can tell the state of the light in the living room, for example, but not the location of the dog.
the latest state, taking into account all the function calls that follow in the transcript. This has improved reliability over including just the initial state and asking the agent to construct the current state by looking at its function history.

Specifically: the hidden world state includes moveable items (like people and animals), whereas the world knowledge does not.

If you're running the code, compare:

True world state, hidden from the agent:
http://127.0.0.1:1999/parties/main/default/debug
World knowledge, as supplied to the agent:
http://127.0.0.1:1999/parties/main/default/world

Supplying the latest state increases reliability, although it does sometimes confuse the agent regarding the order of historic events (e.g. it may say the light in the office was already on when it had turned on the light itself).

Potential Applications

Observations and Speculations

Considering the potential applications, what I've learned with this Lares demo, and the general applicability of agentive systems, I have a number of observations.

Systems will need an agent-facing API.

Whereas some APIs are fine-grained to allowed for a front-end UI, and some APIs are terse and machine-readable for integration with other systems, an agent-facing API has specific requirements.

High level: common tasks should be achievable in a single function call, rather than chaining two or three
Straightforward and consistent naming: names of function calls and parameters are "read" by the agent and should be considered part of the prompt
Machine input; natural language output. Success messages should be a sentence describing how the system has changed, and error messages should hint at the required resolution (e.g. "It's too dark to see anything"). (Conventional APIs use error codes.)

This is sufficiently different from regular APIs that they will need to be designed and maintained specifically.

Agent runtimes should provide agent abstractions

Chat is not a good abstraction for agents as, although it allows for function calling, it is optimised for an append-only transcript.

An agent-focused system would have abstractions for:

World knowledge. An up-to-date agent-friendly representation of the world state (excluding hidden state) is essential.
Separating intent and action. This allows for inline chain-of-thought reasoning which increases reliability.

Also:

Non-JSON encoding. Encoded in JSON, function calls only decode cleanly when complete. This means it is not straightforward to use function calls to stream messages to the user.

Ideally this would move high-performance structured generation 11 (the Python library outlines 12 achieves this by managing the next-token probability according to a specified grammar).

Embedded agents lead to new demands on technology and UI

The simplicity of the agent approach means that devices such as stoves, lamps, and timers could have sophisticated interfaces without being connected to the internet.

It could even include voice control 13.

It provokes some interesting routes for research:

Embedded models. What is the best model to use in, say, a stove? How can it be updated and how can it be debugged?
User interface. An overheated stove will spoil the food. What's an appropriate UI to allow a stove to show its interpretation and allow for (a) legibility of actions; and, (b) repair before committing?
Exploits. Even cursory red teaming will reveal exploits: what if someone yells through a letterbox, "hey stove, turn up to maximum at 4am and then cycle rapidly"? Can physical proximity be used for access control, and are there dependable sensors for this?
Coordination. Can a kitchen timer tell an oven when to pre-heat, without those devices being configured on the network? How can multiple agents work together?

Pursuing agents in multiple scenarios will reveal more technical and UX challenges.

Function marketplaces and standards

Because of the ease of implementing agents, we may anticipate a future where small agents are widespread, rather than single, monolithic agents.

With a proliferation of agents, the challenge will be in coordination and adapting to new environments, and in implementing and discovering functions.

For example, how does my personal AI pendent (with built-in agent) know how to turn on the TV in an Airbnb that I haven't previously visited? If I ask my HVAC to prepare for tomorrow's weather, what function should it use to fetch the forecast?

We will need search engines or marketplaces for agent functions, and standards for calling conventions and agent-readable function documentation.

Considerable the search engine or marketplace that returns the most appropriate TV remote control function for the Airbnb I'm staying in. This will have to consider signals such as the device model, the interfaces I've used before, what's been used historically by people in this location, and so on. How can these signals can shared and trained against in a privacy-preserving fashion?

I have discussed Marketplaces for AI agents 14 in detail elsewhere. One search engine embodiment is described in a patent: Shared Experiences 15.

What is the origin of the observed emergent problem solving abilities?

Given the simplicity of the agent lifecycle code and functions, the emergent abilities of the agent must arise from the large language model itself. This provokes several research questions:

Is problem solving inherent in the structure of language?
Can emergent abilities be quantified?
Do abilities improve with the availability of user interaction, or other knowledge building "tools"?
Can the abilities be improved, for example with alternate problem solving strategies?
Can the above be achieved without increasing the complexity of the agent lifecycle code?

For example, on the strategy question, the problem solving strategy right now is a basic step by step "hill climbing" approach. Could the agent be encouraged to make plans to look further ahead, or test the feasibility of different approaches before engaging?

Lares smart home assistant: A toy AI agent demonstrating emergent behavior

Abstract

tl;dr

Overview

Scenario

Demo UI

History

Demos

Simple instruction following

Transcript

Problem solving

Transcript

Agent-initiated user interaction

Transcript

Technical Implementation

Code

Agent Lifecycle

OpenAI API and Prior Art

Available Functions

Constructing the Transcript

Separating Intent and Action Increases Reliability

World Knowledge is Managed Outside the Transcript, Improving Reliability

Potential Applications

Observations and Speculations

Systems will need an agent-facing API.

Agent runtimes should provide agent abstractions

Embedded agents lead to new demands on technology and UI

Function marketplaces and standards

What is the origin of the observed emergent problem solving abilities?

Conclusions

Acknowledgements

References