Auto-GPT Reading Group Meeting Transcript

has any questions feel free to interrupt me at any time. So this paper is on large language models and planning. So historically LLMs have not been amazing at creating plans or executing them. Specifically creating them, they often have holes in them or oftentimes if you followed the LLMs plan for a task it wouldn’t actually result in the test being completed. So this paper combines LLMs with a more classical form of AI. It’s kind of known as like a planning agent. So how this works is it gets the planning agent is given a bunch of descriptions about the environment, the task, its goals, and what it can do. It does that in a domain definition language specifically called DDDL in the case of this paper. Just for clarity I wasn’t familiar with this before so I’m going to share my screen. I’m going to share my screen. Some slides. About PDDL. So planning domain definition language, the components of a PDDL planning task which will be fairly important to know for understanding this paper. So objects, things in the world that interest us, the predicates. It’s just statements about things in the world. Initial state is the starting state. Goal specification is the ending state that we want to end up in. And the actions, operators, are how we interact with the world. It’ll become more clear in a second how that applies to our specific paper. So continue with the abstract. It does this by if you give an LLM a task, it will then translate that task into a plan and it will then translate that plan into PDDL. So this domain definition language for the solving agent. And LLM is quite good at translating tasks from natural language to PDDL which is kind of one of the discoveries of this paper is that LLMs are bad at planning on their own but they’re good at translating natural language to the domain specific programming language that’s needed by task solving agents or planning agents. So that’s how they combine these in these two papers. That’s how they combine those two models in this paper. So this is an example of like a planning failure for GPT-4. It’s not able to completely think through the steps and the steps it gives would not actually result in the goal state based on the initial state. There are a few theories later on the paper about why LLMs are not really well suited for this kind of planning. But this example task of stacking blocks is like a classical task in AI planning agents and has been around for decades. So there’s a very large and established body of research about how to solve tasks like this with AI and it’s by using that PDDL language. So this is a fairly robust and mature area of research about how to solve tasks with specific initial states, goal states, and operators. Yep. So, yeah, so it goes LLM to PDL and then the PDDL language is fed to the planning algorithm and then planning algorithm gives back PDDL instructions to the LLM and the LLM translates those to natural language. So they were able to find that by combining an LLM with planning, that’s what LLM plus P means, they’re able to do more than just the LLM on its own. In this paper, they referred to that as LLM as P, so LLM as the planner, opposed to LLM as the planner. So this is an important limitation that’s worth noting. They don’t actually ask the LLM to recognize when a prompt that it’s been given should then be translated into PDDL and PDDL is not a good idea. So they’re using PDDL to recognize when a prompt that it’s been given should then be translated into PDDL and fed to the planning agent. So it’s not able to invoke this function of using this planning agent on its own. So I think that’s quite a difficult task and would be quite, or the fact that they are not able to invoke or have the LLM invoke this on their own is fairly limiting for AutoGPT’s perspective, I think, because our GPT or GPT4, whatever language model we’re using, would have to know that it should utilize this. Potentially, if we put this planning agent within a plugin or command and then give that context, it could learn how to use it, but it’s a limitation to keep in mind for this specific paper in their implementation. So classical planning problems. So this is where PDDL comes in. So this problem P, there’s a, I guess, does anyone have any opinions on what we should call this? Okay, so that’s, so we have a problem P and P is defined using, this looks like a P, but I don’t want to call it that. We’ll call it scripty P. So scripty P is an L. Oh, an L? Okay, perfect. Thank you for that. So L is a finite set of states, so it’s all the possible states in the world. So, for example, in this example for block stacking, L would be all possible combinations of states of these blocks stacked in any possible representation. So that’s the entire search space that we’re looking through. And then each specific state is a member of L. So, and then each specific state that we’re working with is a factored state space. So it’s an L and it’s defined by a fixed set of variables. The initial state is also, of course, within L. And then our goals are a subset of L. So of all the possible outcomes, or all the possible states of our space, there’s a subset of all possible states that are the goal states. And that’s what we’re trying to work towards. Then A is a set of symbolic actions, and that’s how we do things in the world. So that’s how, for example, you would move blocks in this block moving example. You would operate on, for example, you’d put like B5 on B3, and on would be the operator. Or in this context, syntax, it’s move is the operator, and that’s a member of A. And A is the set of all possible possible actions. One second, sorry. So this is a very important figure that’s worth looking through. So this is the state, this is just a normal LLM. So it’s the LLM is planning. So it’s given the domain, which means the explanation of the task and the initial state, all those things about how to operate and what the task it’s being given is, and then the problem specifically. So that would be the end goal state that’s desired. And then it within itself, within the language model itself, formulates a plan. It’s not great at that. So LLM is planner in context learning. So it gives, in this example, it’s the same setup where the domain is given as well as the problem, but in addition, context is given, and context is given by essentially a one-shot example. So it gives a single example of how to complete a task inside the domain that is different than the problem that the LLM is now given. And then LLM plus P is what we’re talking about. So it gets the problem, and it also gets context, which is an example. The LLM is then given that, but it’s not asked to solve the problem. It’s just asked to formulate the PDDL. So really what the context in this is providing is really just the structure of how to use the PDDL. I’m pretty sure later on in the paper, they talk about how important this context is to explain to the LLM how to write good PDDL. So then that, the PDDL for the problem, that the PDDL in this case explains the specific problem the LLM was just given, and then both the problem and the domain, which is the description of the problem set and how you do actions in the problem set or in the problem space, those are both given to the planner. This goes into a bit of a separate field of AI and computer science of classical planning agents, but planning agent either will always return either a correct solution or oftentimes will actually return the optimal or close to optimal solution. I think in finite time, all PDDL tasks should have an optimal case that can be found in finite time, but that PDDL plan is then returned from the planner, still in PDDL, and the LLM translates that PDDL plan into a natural language plan and then is able to execute on it. This is a link that, this just goes to the slides that we, I showed earlier. If you’re interested in more about PDDL, that’s a good resource. So, LLMs are bad at planning, especially on a long time horizon, but they’re good describing and especially translating, and that’s essentially what this task is. So, they’re translating from natural language to PDDL, and I have a note here about how that’s arguably a more complex task than actually solving the problem. So, I’m going to show you a little bit of a demonstration of how that’s a more complex task than actually searching a search space for the correct solution to a planning problem, but LLMs are able to do it very well. They translate their English language task into PDDL and then pass that to the planning agent. So, in context learning, here is an example. So, the context provided is an example problem. So, object, the initial state, and the goal state, and this is just explaining to the LLM how it should write the format it passes to the planner agent. So, the LLMs output should follow this same structure, which it does, and then this is what the plan returns, which is the solution to this problem. A limitation of using this design of LLM plus P is that it requires a problem domain to be defined by a human expert beforehand. So, the domain is the description of the space and the functions that you can do within it. So, they’re each context specific. So, in this case, the domain is moving blocks, and the blocks in there, the ability to put them either on the table, put them on another block, or have no block above it, that’s the full domain. And to utilize an LLM plus P, we have to have domain already predefined by a human for the LLM to create PDDL tasks within. So, if an LLM came across an open-ended or real-life task that doesn’t have a strictly defined domain, this likely wouldn’t work, at least right now, because you need that human to provide a domain description. I think there’s a pretty interesting future for LLMs actually defining domain descriptions on their own. I think that would be pretty promising if you would be able to replace humans with LLMs in that case. So, the assumptions they make for LLM plus P are that the chatbot knows when to trigger, or the LLM knows when to trigger the planning routine. I think that’s a pretty big assumption. The domain PDDL file is provided by a human, another big assumption about every task. But then the third assumption that a simple problem in natural language and its corresponding problem will be provided beforehand is more reasonable of an assumption. The classical planning is related work. They mentioned Shakey, which is the first robot to have been equipped with a planning component. So, this is really some of the first work in computer science with planning agents. I’m going to switch my screen and share an image of Shakey. I think it’s worth coming out of the way. This is what Shakey looks like. Shakey was a robot within an AI computer science research lab. It navigated this little test area with blocks. Just an interesting sample of historical uses of AI. So, people have tried to just use language models on their own as planning agents. It tends to not go very well. But it’s a great example of how people have tried to use language models on their own. It’s a great example of how people have tried to use language models on their own. So, people have tried to use language models on their own as planning agents. It tends to not go very well. A major drawback is their lack of long horizon reasoning planning ability for complex tasks. A limitation that they often output essentially nonsense when asked to provide a plan. Oftentimes their plan won’t work if implemented. So, it’s just incorrect on its own. They cite some other work that is related to classical planning with LLMs. I’ve added both of those papers. So, the paper discussion list, I think they could be quite useful and on topic for auto-GPT as planning is a major weak spot for auto-GPT at the moment. They mentioned Toolformer as another example of LLMs accessing external tools or modules. Toolformer is a paper that we read a week or two ago. So, LLM plus P, it’s important to note, doesn’t rely on fine-tuning or retraining of the LLM itself. So, that doesn’t mean that it’s not a good idea to use LLMs. So, that is a big upside for auto-GPT as we currently don’t have infrastructure to fine-tune models or use fine-tune models. So, the fact that they are able to do this with generic LLMs is pretty promising. The fact that generic LLMs can translate natural language tasks to PDDL task descriptions is also quite promising for the applicability of this to auto-GPT. So, they conduct some experiments. Some of the following questions they consider and look at are, how does LLM as P, and LLM as P is the language model itself is making the plan. So, how well does that work? To what extent can LLMs directly be used to plan? They’re quite bad at it. LLM plus P works much better than LLM as P. And context is extremely important in the success of LLM plus P, and that’s because the context to the LLM is really important for, or it helps the LLM a lot in generating correct and valid PDDL code for the domain it’s working within. So, that context, or really in this case it’s an example, is quite important, and it’s a pretty big limitation, I think. So, here are some problems. These are classical planning problems that have existed in the field for a long time. And block world is you get a bunch of blocks the You get a bunch of blocks the robots task with rearranging them in specific order. That’s what we saw up above. Barman is a cocktail making robot, and there are a bunch of combinations they can use to combine ingredients and combine actions. Floor tiling is about painting floor tiles and moving around. Grippers, the robot is given a set of grippers and then objects, and then objects are placed within different rooms. So, it traverses through rooms finding and moving objects. Storage, it’s a combinational problem of moving different ropes to move hang boxes to different heights. Essentially, it measures the AI’s ability to see how inputs affect other inputs, because you have to combine two inputs to get the desired output. So, trims or terms is for blocks, or building complex structure by carrying and placing blocks. And tire world, which is my favorite task, it’s given a task to replace flat tires with enact tires on the wheels. It requires inflating the enact tires, tightening the lug bolts, and then moving all tools back to the boot of the car when it’s done. This ends up not being a great example for L1 plus P, but I think it’s quite interesting and a bit of a bizarre task planning world. So, for the experimental setup, they use TaxdaVinci 3, because apparently the time publication they didn’t have access to 2PG4, which is unfortunate. I think that would have been really great to see, and I hope OpenAI in the future increases access, especially to researchers. It seems like if anyone should have access to 2PG4, it’s people who are publishing papers like this. So, out of the experiments, the finds they discover are LLMs as P, so LLMs planning on their own, are able to produce a plan for every problem, but most of the plans are not feasible. That’s mainly because LLMs as planners lack the ability to reason about preconditions as well as the states of different objects. It just can’t store that much information about this little reality you gave it, and it often can’t understand the implications of moves it makes and then remember those implications later on. So, if it moves a block, it often forgets later on that it moved that block, for example. TileWorld, this is just essentially this entire finding is just a showing that TileWorld is a pretty bad task, because if you give the LLM context, it’s able to perform most problems you give it, and that’s just because almost every problem is the exact same. So, it’s just able to copy from the context you gave it. In other contexts of the TileWorld, LLMs as P fails pretty poorly. In particular, blocks world, like I was talking about, it can’t keep track of properties like whether a block is on another block or whether there’s nothing on top of the blocks and it’s quote-unquote clear. In the gripper’s domain, the robot can only pick up balls that are in the same room. It doesn’t understand, or the LLM on its own doesn’t understand that it needs to, and it’s able to move between rooms. So, that’s a big limitation of LLMs as planners. This is an interesting chart just highlighting how much better LLMs combined with the planning algorithm are compared to LLMs as the planner just on their own with no context. This is LLM as the planner with context. So, with an example, an LLM with the planning algorithm both no context is interestingly also zero. I’m pretty sure this is because not a failure of the planning algorithm, it’s actually a failure of the LLM, because when the LLM is not provided context to the problem it’s given, it often doesn’t output correct or valid PDDL code for the problem space. And when you give the context of a similar problem within the same domain, it’s then able to look at that example and then accurately apply that syntax from the example to its own problem. And that’s why success rates are so much higher with the LLM combined with the planner when the LLM is also given context. Yep, and this is exactly what I was talking about. The failures of LLM plus P with no context come entirely from incorrect problem encodings. So, that’s due to the LLM itself not being able to correctly translate natural language to PDDL code without having context. Future work planning in general with large language models and specifically using planning algorithms. Future work, like I mentioned, that would be interesting would be allowing the LLM to auto detect when and how to apply the planning procedure, somehow training on a start token, and then also the LLM to automatically interesting would be allowing the LLM to auto detect when and how to apply the planning procedure, somehow training on a start token to initiate this planning procedure, as well as reducing the LLM plus P’s dependence on domain descriptions being written by humans. I think that’s also either something that could be achieved, like they say in this paper, through fine-tuning. I think even more so you might be able to actually get a model, a separate model, to do the domain descriptions separately as they’re, once you have them, it’s a one-time thing. And that’s the end of the paper. Any thoughts? Yeah, I wanted to get a little bit more into PDDL. So it’s telling us that we need to describe the context in PDDL as the observable features and functions, right? Yep. So, so what is PDDL in this case? Is it already a language that exists? Yeah. So, am I still showing my screen? Yeah. So this is PDDL code. So, these are the objects, this is the initial state. So, in this case, this is the initial state. So, in this initial state of grabbing blocks, your arms are empty, B1’s on the table, and then here are the relations of all the other blocks to each other, and here’s the goal state. And then the PDDL code that’s produced by the planner is a set of actions, and then two operator, depending on the action, arguments to the action. Do you have a more specific question? Would it be possible to use an alternative to PDDL for planning? Ah, okay. I’m not totally sure. So I think, in general, PDDL is a pretty standardized way to represent problems in planning. So, if you wanted to use a planning agent, I think you’d have to use PDDL because it’s essentially the standardized way to represent problems. Right, okay. I was going to give an example, maybe for context, if that’s okay. Yeah, for sure. So, I was able to build a planner plugin for AutoGBT, which, now that I read this paper with you, I understand it’s an LLMST project. So it takes the list of goals that everybody knows about, and it creates a list of tasks, and from there, it tries to make the plan itself, right, as a tool, for example. So, that is definitely working for a bit, but from other people’s experiences, it tends to recur to the issues that you mentioned, which is like, it tries to do the same task more than once, or, for example, it tries to check off something, although it hasn’t been done. So, although it’s readable, and you might think it’s able to follow along the plan, it still drifts off because it’s not like tied to an execution language, right? Yeah, a benefit of PDDL is that the planner will output a complete solution with every step that needs to be taken to complete the task. So, it’s just much more complete. So, in the context of AutoGBT, the PDDL would have the command, like here you have, for example, on table, on, and so on, and then you have like a planner here is doing like on stack, but down. Essentially, it’s just having those two commands, but in the case of AutoGBT, those planner PDDL files would have our actual commands. Yeah, exactly. So, if we had a PDDL description, for example, of AutoGBT, AutoGBT, I’m not sure if AutoGBT and our commands technically fit within the, like, formal definition of what tasks or what domains can be described within PDDL, but as an example to think about, it would be the PDDL in our case would be things like open file, write to file, append to file, execute file. So, all of those individual commands would then be strung together by the planning agent, and then the planning agent would then give those instructions back to AutoGBT who would execute them. Does that answer your question? Yeah, it does really shed a light, right? So, the planner agent in this case would be a separate agent to the executioner. Yep, totally separate. It’s almost not even really, or it’s not a transformer, it’s not a large language model, it’s really a search algorithm because it searches the total problem space for the goal state. Anyone else have any thoughts on the paper, questions? Yeah, thanks for doing this. This was great. I had a question about the domain. Isn’t it assumed that if, like, say we’re running in AutoGBT, we’re going to tell the AutoGBT what to do anyway, isn’t that, like, a way to tell the domain what to do? Say we’re running in AutoGBT, we’re going to tell the AutoGBT what to do anyway, isn’t that considered as the domain? Versus, like, it seems like it was framed as, like, a pretty big limitation in the paper? So, the domain wouldn’t be what you tell you want AutoGBT to do. The domain would be all the things AutoGBT can do, and the domain will also include other things like the AutoGBT workspace, because it can work within there, as well as essentially the internet, because tasks involving AutoGBT can involve searching out to the internet. So, the domain would be everything around AutoGBT that it kind of exists in, and then the specific problem would be the prompt that you give it. Gotcha. Could the domain be a general one, then, that includes, like, all the different stuff, like analyze code and search Google? Potentially. Like I mentioned earlier, I’m not sure that AutoGBT’s commands and its environment fit within the formal definition of what kind of planning problems can be defined using PDDL. Because every action and command within AutoGBT are not deterministic, and you can’t predict their outcomes, and the ability to have an initial state, and then an action, and an operator on what you’re going to do the action on. So, for example, in the block example, if you’re given the initial state, and then you say, pick up block five and move it on to block three, you should be able to understand what’s going on you should be able to understand and determine the entire current space just by knowing what command has been done. But because commands within AutoGBT are not deterministic, like we don’t know what Google is going to return, we don’t know how that will then influence future commands and operations by AutoGBT. Because of that, we don’t know how essentially path three tasks will work with AutoGBT, and because of that it’s really difficult to search within a search space for solutions. So, I’m not sure we will be able to represent AutoGBT’s environment within PDDL inherently. Gotcha, thanks. Any other questions or thoughts? Anyone have any papers they are interested in or want to read for future sessions? What was this paper you guys went over today? So, this was LLM plus P. It combines an LLM with a classical planning agent by translating natural language tasks into a domain-specific programming language, like in this case, programming domain description language. And then it feeds that PDDL description of the task to a classical planning agent, and the classical planning agent returns in PDDL the formal steps that need to be taken to reach the goal state. And then the LLM translates that PDDL, those PDDL instructions back into natural language and then executes them. Cool. Thank you for explaining that. Yeah, I found the paper. I also put that in the subtext if anyone needs it. Oh, perfect. Thank you. So, the part I find the most interesting is how we could make the PDDL part or like the planning agent part fit a different use case, right? So, right now it’s using PDDL, but if there are other planning tools, for example, something that’s aware of time, right, that way you can prioritize stuff. That would be interesting. Oh, yeah, that would be interesting having… So, would the LLM or the formal planning agent be the one that has temporal awareness of time? Yes. So, the LLM wouldn’t be aware of that, but the planner agent would, if that makes sense. Yeah, that does. Interesting. We would have to do some research into whether any planning algorithms exist that are able to take time into account, but I assume it wouldn’t be that difficult to essentially assign metadata to different operations and objects and the initial state in order to explain time and the effect of time to the planning agent. Metadata would most certainly be the most efficient and truthful way to do it, right? Yeah. You wouldn’t have hallucinations off of reading two timestamps, especially if it was an interpretation that was just like extrapolated from the understanding that, hey, go check the time that this worker has been running. This is the reported time back. Does that fall within guideline X? No. Or yes, and then make decision. Yeah, for the LLM, they’re definitely able to understand time and the relationship between different times. I was thinking more about the classical planning agent. How would you explain within the domain description to the planning agent that, for example, an action that can take, like picking up a block and setting it down, how do you explain to the planning agent that that action takes 30 seconds or that some other action takes one minute? Yeah, I guess it would require building a specific framework for that. I’m not aware of any system that would require that. Yeah, I think that’s a good point. Yeah, it might be worth looking into. I think it could definitely fit within the theoretical framework of using PDDL to describe problems and solutions, but temporal considerations might be difficult for a search agent to consider. Anyone else? Yeah, if anyone has a question, please feel free. I’m sorry, you were breaking up for me. I was going to say that maybe there’s another route that I don’t know if Disho was working on before, which included using a calendar system. The planner would be integrated with Google Calendar or your iPhone calendar and use the same tasks that exist there to see what is going to be necessary, what’s going to be needed to execute the plan. Yeah, that could definitely work for the LLM as it’s able to interact with external services. I don’t see any way that you could develop a formal or classical planning agent to be able to interact with external sources or information just because the entire search space is so defined. Anyone else have anything they want to add or talk about? Okay, I guess that concludes our discussion of the paper. Cool, thanks. Thanks for coming, everyone. I hope my explanation helped. It really did. I think I’ve grasped it. I appreciate your time to read and to explain it. Of course, yeah, happy to. I guess we’ll meet again on Friday at 1 p.m. I think we are reading the Simulacra paper. Oh, nice. That’s a good one. Yeah, it should be really fun. I’ve been hearing about it non-stop. Is the LoRa paper not on the schedule anymore? Oh, wait, actually, no, you’re right. It is the LoRa paper on Friday. Simulacra is on Monday. Thank you for correcting me. I’m looking forward to the LoRa one, so I wanted to know where it was in the timeline. Thank you for clarifying. Do we have a repository of links for past papers? Yep, all the drafts are there. Okay, thanks. So, we’re just going to hang out in this chat now? Yeah, we can. Okay, Mark, are you there? Can’t hear you if you’re talking. I think that was an accident. But I got the auto-GBT plugin for running and Discord working, so that’s good.