If you are proficient in Python 101, you are probably better at programming than OpenAI’s Codex prototype • The Register
OpenAI has warned that its Codex neural network, like the one that powers GitHub Copilot’s code completion tool, is likely to generate a source that looks plausible but is incorrect, and its performance will decrease as it grows.
The artificial intelligence lab revealed the shortcomings and limitations of non-production versions of its Codex model in a pre-printed document this week. It should be noted that a separate production variant of the system powers GitHub Copilot; the preliminary models discussed in the document are smaller and only trained on Python, while Copilot has been trained on more data and supports code completion for a range of programming languages.
Yet GitHub Copilot suffers from similar issues as Codex prototypes. Namely, the generated code is unlikely to be correct and useful to developers on its first attempt, and it tends to give answers that at first glance seem reasonable but may be wrong. Programmers should carefully check the auto-written code for any errors.
The Codex language model in the document has been refined on 159 GB of Python source code pulled from over 50 million public repositories on GitHub. Each Python file analyzed contained less than a thousand lines of code. To test the model’s AI pair-wise programming skills, researchers proposed 164 handwritten programming problems that examined Codex’s ability to perform functions, understand simple algorithms, and solve mathematical queries.
The most powerful version of the system with 12 billion settings was able to solve 28.8% of the problems on its first attempt. For comparison, OpenAI’s GPT-3 natural language system could not solve any of them.
However, Codex works best when it has the opportunity to generate more responses. In ten attempts, he found the correct answer 46.81 percent of the time, and in 100 attempts, that number climbs to 72.31 percent.
In other words, it’s up to the human programmers or maybe the tools to do the job of selecting the best Codex suggestion. “This result suggests that precise code samples can be selected via heuristic ranking instead of fully evaluating each sample, the latter may not be possible or practical during deployment,” the document said.
GitHub Copilot seems to work slightly better; it can generate a correct code 43% of the time on the first try and 57% of the time when allowed to perform 10 tries. If you were worried that these code completion tools could replace human programmers, don’t worry because they pose no real threat at this time and are just simple pattern-matching machines. Good for generating boilerplate code and stuff like that.
The human touch
The researchers admitted that “a strong student who completes an introductory computer science course should be able to solve a larger fraction of problems than Codex”, even if he has seen more code than a developer. professional will never see one in his life.
Codex tends to duplicate the common coding samples on which it has been trained; if you write something that looks like it will fill in the blanks with what it thinks should be fine next although the generated code is often not quite correct. If you are writing something more specialized for a particular application or more complex than most scripts, Codex won’t be as useful.
“We find that Codex may recommend syntactically incorrect or undefined code, and may invoke functions, variables and attributes that are undefined or outside the scope of the code base. In addition, Codex finds it difficult to analyze increasingly long and higher level or system level specifications, ”the document states.
This problem only gets worse as models get bigger and more powerful, the newspaper said. Codex is also as good as a programmer like you, unfortunately. If you give it prompts that contain subtle bugs, it “will tend to produce worse code than it is capable of.” This persists when the prompt also includes instructions to write the correct code. This gap increases with the size of the model, ”the researchers wrote.
They also cautioned that, like other language models, Codex can also be tricked into generating “racist, disparaging and otherwise harmful output” in the form of code comments. Gender or race biases were also observed in code structures. To prevent damage in the real world, GitHub Copilot comes with filters that automatically block offensive words so it can’t spit out toxic outputs.
It’s not all dark, don’t get me wrong. OpenAI said it wants to focus on the potential positive impacts the tool could have, such as making programmers more productive or if it can encourage them to better document their code for others to read. ®