Technical Challenges of Working with ChatGPT API

October 11, 2023

Nikolai Polevoi

Back-end Developer at JetRockets

Today, I want to tell you about the issues you may encounter when interacting with the ChatGPT API and how we solved them in the Luna Stories project. The project's idea was to generate children's stories using a text-based AI model with interactive story continuation choices, where the main characters are the children themselves.

Token Limitation with GPT-3.5 Turbo

One of the initial challenges when working with GPT-3.5 Turbo was exceeding the token limit. It's important to note that we later switched to the more advanced GPT-4, which had a larger context limit. However, when we started developing the application, the new version was not available, so we had no choice. In your case, the choice between GPT-3.5 and GPT-4 may depend on your project's specific requirements and budget. You don't necessarily have to use the "smartest" model, considering that GPT-4 (the latest version at the time of writing) comes at a significantly higher cost.

GPT-4:

GPT-3.5 Turbo:

Markup, and how you can save a little money on it

Let's start with the basics. To fully utilize the API, it's essential to understand what context is. Context allows you to use GPT like a chat. The model itself doesn't remember individual queries – the developer needs to handle that in the application. Context has its own markup and structure (ChatML) based on roles:

{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the World Series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}

While "user" and "assistant" roles are self-explanatory, it's crucial to pay attention to the "system" role. Its purpose is to define the model's general behavior. However, it's essential to understand that the "system" role may have varying effects on the neural network in different GPT versions. In GPT-3.5, the "system" role has such minimal impact that its use is just a waste of tokens. For this reason, we did not use it at this stage of development.

Token Optimization for Story Generation

Such a small savings would not have solved our problem. We've exceeded our token limits too much. It was necessary to trim context, but to do so in a way that did not degrade the quality of the generated stories. We developed a pattern for context reduction through trial and error:

When generating a preview of a story (at the first stage of development the preview was generated separately, about which more later), there is no context. Only the user request to the neural network is present, so there is no need to trim anything;
For the first two story chapters, we decided not to truncate the context, as we could fit within the token limits, and the generated content met our requirements;
For generating each of the remaining story chapters, we used the two previous chapters as context. This significantly optimized token usage and allowed us to generate stories seamlessly from start to finish.

Efficient API Usage

Typically, requests to the OpenAI API are not very fast. Additionally, considering context handling, more requests lead to higher token consumption, which is not free. Therefore, minimizing the number of requests will result in faster and more cost-effective interactions with this technology.

Our project involved a feature that allowed interactive selection of how the story would unfold. The logic was that after each chapter, two possible continuations of the story should be presented. The first attempt to optimize this process was to try and make the API return both the story chapter and its continuation options in a JSON format, which would then be programmatically parsed. This task was successfully achieved, and the API consistently returned responses according to the specified schema. However, due to high token usage requirements, the content of the story itself turned out to be rather short and did not meet our client's expectations.

To solve this issue, we decided to split the generation of a story chapter and the generation of story continuation options into several different queries. The basic solution was to generate the text of a chapter, then generate a summary of that chapter, and finally generate each answer choice separately based on that summary. The summary was necessary solely to avoid sending the entire long chapter text in each request to generate each story continuation option (I refer you to the previous problem in this article).

We needed to optimize this process. The idea of generating in JSON format seemed very good to me, and I decided to apply it to generate all of the above at least partially. The reasoning was as follows - since the only problem we have is the amount of text in the chapter itself, we will generate it only with a separate request. The rest, representing two variants of the story continuation, we will ask to generate in JSON format on the basis of the separately generated chapter. This optimization reduced the number of requests, and the request for generating options was super fast as it didn't require heavy "thinking" from the AI.

Still not fast enough

However, response times remained insufficient despite the optimizations. Chapters were generating faster, but users still had to wait for the generation of the next chapter after making a choice. So, we implemented the following:

The user clicks the payment button and waits for the generation. During this time, the first chapter is generated.
Once the first chapter is generated, the user gains access to read it. Simultaneously, the generation of two potential next chapters begins using Sidekiq workers. During this phase, the choice buttons are locked to prevent the user from making a selection until the generation of the next chapters is complete. As expected, this did not negatively impact the user experience. In practice, the chapters were almost always generated before the user finished reading the current one, which was our initial goal.
The user finishes reading the current chapter and can choose from the two options for the story's continuation, which were already generated by that point. After clicking one of the options, in the background, the generation of the two following chapters begins, based on the user's chosen option. This process repeats until the final chapter, concluding the story.

As you can see, this sequence of actions does not mention the generation of previews. This is because initially, previews were designed to provide the user with a quick response containing a brief description of the story. Only after seeing the preview could the user decide whether to purchase the full story. Thus, we aimed to save tokens – why generate chapters of a story that the user might not purchase? Later on, the client prioritized user interaction comfort over token savings. Therefore, users now generate chapters in the background while previews are taken from the first few lines of the fully generated chapter.

Summary

Such technologies can be quite imaginative and can inspire numerous ideas. However, modern AI have several quirks, some of which are challenging to analyze and require solutions by matching in practice. In this article, we discussed the technical difficulties directly, but the unpredictability of the model's behavior presented more significant problems. There were cases where nine generations went well, and then one unsuccessful generation would disrupt everything. Conversely, we'd find errors, and it wasn't clear how the neural network could generate anything coherent in the first place.

The funniest part is that sharing detailed experiences often doesn't make sense because each situation requires a purely individualized solution. It's important not to delude ourselves and treat these solutions, like Chat-GPT, as a panacea. As one of our CTOs said, Chat-GPT is like a not-so-smart Junior Developer. It seems to be capable of something, but it's better to double-check a few times.