The transformer network, a class of neural network that Google researchers proposed in a study in 2017, serves as the foundation for ChatGPT’s design. When creating predictions, the transformer network weighs the significance of various input components using self-attention techniques. This network is intended for processing sequential data, such as text.
This article will cover the following topics:
- Introduction to ChatGPT.
- The Technical Principles of ChatGPT.
- Whether can ChatGPT replace traditional search engines like Google?
Introduction to ChatGPT

As an intelligent dialogue system, ChatGPT has exploded in popularity over the last few days, generating a lot of buzz in the tech community and inspiring many to share ChatGPT-related content and test examples online. The results are impressive. The last time I remember an AI technology causing such a sensation was when GPT-3 was released in the field of NLP, which was over two and a half years ago. Back then, the heyday of artificial intelligence was in full swing, but today it feels like a distant memory. In the multimodal domain, models like DaLL E2 and Stable Diffusion represented the Diffusion Model, which has been popular in the last half year with AIGC models. Today, the torch of AI has been passed to ChatGPT, which undoubtedly belongs to the AIGC category. So, in the current low period of AI after the bubble burst, AIGC is indeed a lifesaver for AI. Of course, we look forward to the soon-to-be-released GPT-4 and hope that OpenAI can continue to support the industry and bring a little warmth.
Let’s not dwell on examples of ChatGPT’s capabilities, as they are everywhere online. Instead, let’s talk about the technology behind ChatGPT and how it achieves such extraordinary results. Since ChatGPT is so powerful, can it replace existing search engines like Google? If so, why? If not, why not?
In this article, I will try to answer these questions from my own understanding. Please note that some of my opinions may be biased and should be taken with a grain of salt. Let’s first look at what ChatGPT has done to achieve such good results.
The Technical Principles of ChatGPT

In terms of overall technology, ChatGPT is based on the powerful GPT-3.5 Large Language Model (LLM) and introduces “reinforcement learning + human annotated data” (RLHF) to continuously refine the pre-trained language model. The main goal is to allow the LLM to understand the meaning of human commands (such as writing a short essay, generating answers to knowledge questions, brainstorming different types of questions, etc.) and to judge which answers are high quality for a given message. (user question) based on multiple criteria (such as being informative, rich in content, useful to the user, harmless, and free of discriminatory information).
Under the framework of “human-annotated data + reinforcement learning”, the ChatGPT training process can be divided into the following three stages:
ChatGPT: Step 1
The first stage is a supervised policy model during the cold start phase. Although GPT-3.5 is strong, it is difficult for it to understand the different intentions behind different types of human commands and to judge whether the generated content is of high quality. In order to give GPT-3.5 a preliminary understanding of the intentions behind commands, a batch of prompts (i.e. commands or questions) submitted by test users will be randomly selected and professionally annotated to provide high-quality answers for the specified prompts. These manually annotated data will then be used to fine-tune the GPT-3.5 model. Through this process, we can consider that GPT-3.5 has initially acquired the ability to understand the intentions contained in human prompts and to provide relatively high-quality answers based on these intentions. However, it is obvious that this is not enough.

ChatGPT: Step 2
The main goal of the second stage is to train a reward model (RM) using manually annotated training data. This stage consists of randomly sampling a batch of requests submitted by the user (which are mostly the same as those in the first stage), using the cold start model fitted in the first stage to generate K different responses for each request. . As a result, the model produces the data ,,….. The annotator then ranks the K results based on various criteria (such as relevance, informativeness, dangerousness, etc.) and provides the sort order of the K results, which is the manually annotated data for this stage.
Next, we will use the ordered data to train a reward model using the common peer-to-peer learning method for ranking. For results ordered by K, we combine them two by two to form $\binom{k}{2}$ pairs of training data. ChatGPT uses a pairwise loss to train the reward model. The RM model accepts an input from and generates a score that evaluates the quality of the answer. For a training data pair , we assume that answer1 comes before answer2 in manual classification, so the loss function encourages the RM model to score higher than.
In summary, in this phase, the supervised policy model generates K results for each request after the cold start. The results are manually ranked in descending order of quality and used as training data to train the reward model using the learning pairwise ranking method. For the trained RM model, the input is and the output is the quality score of the result. The higher the score, the higher the quality of the response generated.

ChatGPT: Step 3
In the third phase of ChatGPT, reinforcement learning is used to improve the capability of the pretrained model. In this phase, manual annotation data is not needed, but the RM model trained in the previous phase is used to update the parameters of the pretrained model based on the RM scoring results. Specifically, a batch of new commands are randomly sampled from the prompts sent by the user (which are different from those of the first and second phase), and the cold start model initializes the parameters of the PPO model. Then, for the randomly selected prompts, the PPO model generates responses and the RM model trained in the previous phase provides a reward score for evaluating the quality of the responses. This reward score is the overall reward given by the RM to the complete response (consisting of a sequence of words). With the final reward of the word sequence, each word can be considered as a time step and the reward is transmitted from back to front, generating a policy gradient that can update the parameters of the PPO model. This is the standard reinforcement learning process, the goal of which is to train the LLM to produce high-reward responses that meet RM standards, i.e., high-quality responses.
If we continue to iterate through the second and third phases, it is obvious that each iteration will make the LLM model more and more capable. Because the second phase improves the capability of the RM model through manually annotated data, and in the third phase, the improved RM model will score responses to new prompts more accurately and use reinforcement learning to animate the LLM model. to learn new high points. -quality content, which plays a similar role to the use of pseudo-tags to expand high-quality training data, thus the LLM model is further improved. Obviously, the second phase and the third phase have a mutual promotion effect, so continuous iteration will have a sustained improvement effect.
Despite this, I don’t think the use of reinforcement learning in the third phase is the main reason why the ChatGPT model works particularly well. Suppose that in the third phase, reinforcement learning is not used, but the following method is used: Similar to the second phase, for a new indicator, the cold start model can generate k responses, which are scored by the model. RM respectively, and we choose the answer with the highest score to form a new training data to refine the LLM model. Assuming this mode is used, I think its effect can be comparable to reinforcement learning, although it is not as sophisticated, but the effect may not necessarily be much worse. Regardless of the technical mode adopted in the third phase, it is essentially likely to use the RM learned in the second phase to expand the high-quality training data of the LLM model.
The above is the training process of ChatGPT, which is mainly based on the role of instructGPT. ChatGPT is an improved instructGPT, and the points of improvement are mainly different in the annotated data collection method. In other aspects, including the structure of the model and the training process, it essentially follows instructGPT. It is foreseeable that this technology of reinforcement learning from human feedback will quickly spread to other directions of content generation, such as a very easy to think of, similar to “A machine translation model based on reinforcement learning from human feedback” and many others. However, personally, I think that adopting this technology in a specific field of NLP content generation may not be very significant, because ChatGPT itself can handle a wide variety of tasks, covering many sub-fields of NLP generation, so if a subfield of NLP adopts this technology again, it is not really of much value, because its feasibility is considered to have been verified by ChatGPT. If this technology is applied to other fields of modal generation, such as images, audio, and video, it may be a direction more worth exploring, and we may soon see similar work such as “An XXX Diffusion Model Based on Human Reinforcement Learning.” “. Feedback”, which should still be very significant.
The third phase of the ChatGPT training process is to use reinforcement learning to improve the capability of the previously trained model. In this phase, no human-labeled data is required, instead the RM model trained in the previous phase is used to update the parameters of the previously trained model.

Whether Can ChatGPT Replace Traditional Search Engines Like Google?

Given that it seems like ChatGPT can almost answer any kind of prompt, it’s natural to wonder: Can ChatGPT or a future version like GPT4 replace traditional search engines like Google? I personally think that it’s not possible at the moment, but with some technical modifications, it might be possible in theory to replace traditional search engines.
There are three main reasons why the current form of chatGPT cannot replace search engines: First, for many types of knowledge-related questions, chatGPT will provide answers that appear to be reasonable but are actually incorrect. ChatGPT’s answers seem well thought out and people like me, who are not well-educated, would believe them easily. However, considering that it can answer many questions well, this would be confusing for users: if I don’t know the correct answer to the question I asked, should I trust the result of ChatGPT or not? At this point, you cannot make a judgment. This problem may be fatal.
Secondly, the current model of ChatGPT, which is based on a large GPT model and further trained with annotated data, is not friendly to the absorption of new knowledge by LLM models. New knowledge is constantly emerging, and it is unrealistic to re-train the GPT model every time a new piece of knowledge appears, whether in terms of training time or cost. If we adopt a Fine-tune mode for new knowledge, it seems feasible and relatively low-cost, but it is easy to introduce new data and cause disaster forgetting of the original knowledge, especially for short-term frequent fine-tuning, which will make this problem more serious. Therefore, how to almost real-timely integrate new knowledge into LLM is a very challenging problem.
Thirdly, the training cost and online inference cost of ChatGPT or GPT4 are too high, resulting in if facing real search engine’s millions of user requests, assuming the continuation of the free strategy, OpenAI cannot bear it, but if the charging strategy is adopted, it will greatly reduce the user base, whether to charge is a dilemma, of course, if the training cost can be greatly reduced, then the dilemma can be self-solved. The above three reasons have led to ChatGPT not being able to replace traditional search engines at present.
Can these problems be solved? Actually, if we take the technical route of ChatGPT as the main framework and absorb some of the existing technical means used by other dialogue systems, we can modify ChatGPT from a technical perspective. Except for the cost issue, the first two technical issues mentioned above can be solved well. We just need to introduce the following abilities of the sparrow system into the ChatGPT: evidence display of the generated results based on the retrieval results, and the adoption of the retrieval mode for new knowledge introduced by the LaMDA system. Then, the timely introduction of new knowledge and the credibility verification of generated content are not major problems.