Features of backend development with ChatGPT API

September 08, 2023

Sergei Masiutin

Back-end Developer at JetRockets

In the spring at JetRockets, a client approached us with the idea of a mobile application where people could generate stories for children using the ChatGPT API.

Sergei from the Back-end development team is here! And today, I will share my experience working on the Luna’s Stories project. The basic logic of the application was developed by two developers - me and Nikolai. There was also a third developer who dealt with the admin page. On the frontend side, we had three developers - Maxim, Ivan, and Alexander, and a QA engineer, Dmitry. My article will provide an overview of the backend development, while Nikolai will delve into the intricacies and nuances of our solutions.

Where to begin?

At JetRockets, we primarily develop apps with Ruby on Rails, so the first step was to find a gem to work with the ChatGPT API. In reality, we didn't have many choices among gems. The technology was fairly new, and even the most popular gem at the time, "ruby-openai" by Alex Rudall, which we used, didn't cover all the technological aspects of the API. Yes, we could have used the official OpenAI library in Python, but the company's greatest expertise was in Ruby on Rails. Therefore, ease of maintenance outweighed development convenience.

I won't go into detail about how to use the gem; it's quite straightforward. I'd rather focus on the documentation for using the ChatGPT API from OpenAI, as without understanding it, using "ruby-openai" wouldn't lead to any results.

How do you work with context?

Those who have used ChatGPT in a browser know that interacting with it resembles a regular chat, where you ask questions, the AI responds, and the entire conversation is saved. When working with the standard API, no one saves your conversation; you have to do it yourself. To do this, you first need to know about ChatML - Chat Markup Language.

Without understanding how ChatML works, you can't properly preserve context.

Fortunately, this tool isn't too complex to use, but when we started development at the end of March, we learned that the beta of this tool was only introduced in early March. Currently, it has only three roles - "System," "User," and "Assistant." The first role is used to configure the general behavior for the AI, the second is for user questions, and the third is for AI responses. This way, you can save information within the context and devise AI responses, thus more subtly guiding its pattern of answers.

An example from OpenAI's documentation:

import openai

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}
    ]
)

Alright, I think ChatML should be clear now. For more detailed information, I recommend reading the documentation. We first created a separate model to store context in JSON format; PostgreSQL's jsonb was suitable for this purpose. With each new request, we retrieved the necessary context, added our question, then added the received answer to the context and saved it.

How do you count tokens correctly?

Initially, everything went well: we asked questions and recorded context. Unexpectedly, we started receiving errors that we were exceeding the context limit. As you know, GPT-3 has a token limit of 4096 tokens. We were perplexed because we had calculated in advance using a formula where 75 words roughly equated to 100 tokens, and we relied on this, thinking we had more than enough tokens in reserve.

Now, knowing that this approach wouldn't work, the only correct solution is to use OpenAI's official library - "tiktoken." Some may be familiar with OpenAI's online calculator, but keep in mind that this calculator calculates tokens for GPT-3 - and it's important because the token encoding method for models GPT-3.5 and GPT-4 differs. Thus, the online calculator will give you one number, whereas with "tiktoken," it will be another where you can choose the desired model.

We understand how to count tokens, but that doesn't solve the problem, as we still exceed the limits. Therefore, we decided to trim the context. In the ideal solution, we would use "tiktoken" to calculate in advance how many tokens we have to spare and free up space for new ones. We opted for a simpler approach - trimming part of the context. As a preview, we eventually found a way to avoid doing this.

How to improve the quality of answers?

Next, we encountered an issue with how ChatGPT was generating stories; we were not satisfied with the results. The stories lacked dialogue, everything seemed declarative, and the choices we inserted for the user were dull. To give you context, after each chapter, we used AI to generate two possible story developments for the user to choose from, allowing for the creation of unique stories.

We began studying how to write prompts more effectively, constantly editing them and gradually enhancing the quality. Unfortunately, I don't have any secret methodologies to share for crafting prompts that make the AI give better answers. In practice, besides describing what you want and experimenting constantly, there's not much more you can do.

This is due to the inherent nature of the product created by OpenAI. In fact, the ability to do something different from what the developers originally intended — it's not just a feature. It's a curse. No matter how straightforward or intricate your input is, the AI can sometimes bypass it. Because of this, there's no foolproof method to make the AI do exactly what you want. The only thing we can do is reduce the percentage of incorrect answers and preemptively plan for situations where users encounter incorrect generation.

However, I do have two real ways to improve the answers that the OpenAI developers provide.

Firstly, if you have access to the GPT-4 API, use it. Aside from the fact that this AI gives much better responses, it also has a significantly larger token limit. For us, this meant we could avoid context truncation, resulting in significantly improved stories.

Secondly, make use of the "temperature" setting — a value from 0.1 to 2.0 that allows you to control the stability of responses. The lower the value, the more deterministic the AI's response, and conversely, if it's high, the response becomes more unpredictable. In one case, we request AI to send us a response in JSON format. To prevent the AI from breaking the JSON every time (and it does this very easily, even GPT-4), we set the temperature to just 0.2, while the default is 0.7.

Why so slow?

After starting small tests, we noticed that generating stories took quite a while. On average, users had to wait about 2 minutes to generate a single chapter, sometimes even longer. Naturally, this was unacceptable to us. Our testers, who had children, mentioned that during this time, a child would get distracted and no longer want to read the story.

We decided to pre-generate chapters. This means that if a person generated the first chapter, two possible continuations of the story are generated in advance, so they don't have to wait when choosing the next part. Of course, this requires using more tokens, and the cost of each story increases, but having reading without waiting is more important.

We used Redis and Sidekiq for background generation tasks, and frontend developers checked if the next chapters were ready through GraphQL periodic queries, enabling users to make choices.

A major challenge in implementing this functionality was that we had to rewrite a significant portion of the code to ensure background generation worked correctly. It was particularly important to create a mechanism for preserving context correctly, as it had to become fragmented and assembled in the right sequence just before the request. We addressed this by associating each context fragment with the corresponding chapter and adding a user choice history.

Also, in general, we reduced the size of prompts. Smaller requests lead to faster AI responses, as the machine takes longer to process large contexts. Of course, when shortening prompts, we ensured that the quality of generated stories remained intact. During testing by our team member Dmitry, it was discovered that shorter and less complex prompts led to better stories, although this might not seem obvious.

Should AI APIs be used in app development?

I've attempted to provide a comprehensive description of our experience working with the ChatGPT API, revealing key issues and their solutions. Importantly, it seems that the true challenge lies in the fact that this technology is still very unstable. Many aspects haven't been thoroughly tested, and due to this, our problems arose.

Furthermore, as you may have noticed, there's a recent trend where users claim that ChatGPT has become "dumber" and worse at providing answers. This isn't surprising, as the tool is actively being developed, and new versions are constantly released; as a result, something can go wrong. For those who use AI in development, this means their product becomes even more unstable. Additionally, OpenAI not only rapidly develops new versions but also quickly stops supporting older ones, which means developers can't choose a version and stick with it for several years.

Therefore, when developing such products, it's important to anticipate all the costs involved, understanding that in some cases, users may need refunds, and in others, developers might have to rewrite not only the prompts but also the API interaction service.

Overall, it seems that this technology has a bright future, and it's crucial to learn how to work with it now to remain competitive in the market.