From the course bibliography · Critical thinking and the AI-native graduate
ChatGPT Put to the Test Against Students - With Concerning Results
Thomas Westerholm · Newsweek · Apr 10, 2026
Open originalGraduate students at Harvard outperformed a ChatGPT model by more than two letter grades in a study conducted by researchers at the university.
The researchers expected that OpenAI's chatbot would, "perform similarly to doctoral students on lower cognitive levels," hypothesizing that ChatGPT would be able to memorize materials sufficiently while struggling with critical thinking problems.
However, ChatGPT was significantly outperformed by the students because the model struggled to "remember" and "apply" tasks, although the researchers were able to improve ChatGPT's performance with prompts.
"We wanted to ensure that the tools we were using to measure learning were still meaningful in the era of genAI," study author John Peters told Newsweek in an email.
"[...] We thought genAI would be better at the lower levels and worse at the higher levels, but that's not what we found. GenAI struggled to 'apply' concepts to experimental design questions."
The researchers conducted the study with students from Harvard's Principles of Molecular Biology course, a 200-level class that spans the full semester.
Over the course of the study, the students were expected to maintain a minimum grade of 80 percent, which is a passing grade for doctoral students.
The AI's responses, meanwhile, were produced using GPT-4o, which was released by OpenAI in May 2024.
To make sure that the students didn't use artificial intelligence themselves, the researchers took out-of-class assignments from 2022, before AI was widely available and adopted.
Peters noted that GPT-4o was the latest model at the time, but he tried the latest model currently, and "anecdotally, it has dramatically improved at describing and interpreting images."
"We do think that a general weakness of large language models is their ability to perform multi-step, compositional thinking, so I would predict that even newer models would still struggle somewhat with the 'apply' level of Bloom's Taxonomy," he said.
The Findings Doctoral students outperformed ChatGPT at every level.
The chatbot did well, but significantly worse than students, on "remember" questions. The researchers noted that the questions are not meant to be challenging, but are intended to encourage students to summarize techniques.
Students outperformed ChatGPT 98 percent to 82 percent.
Meanwhile, students out-performed ChatGPT significantly on long-answer design questions. The students also outperformed ChatGPT on fill-in-the-blank questions.
ChatGPT was particularly poor at "understand", "apply" and "analyze" questions, where it earned a 66 percent average, compared to 87 percent by the doctoral students.
ChatGPT would have "failed," and according to the researchers, the poor results were "largely driven by the algorithm's markedly poor performance on the 'apply' level, which refers to identifying, rationalizing and describing experimental controls that students had previously learned through their coursework."
'Good teach is still good teaching' Peters said students have a mixed reaction to AI - some are excited, some use it as an editor, and some have "strong objections" to its use.
"In some ways, this is fairly representative of society as a whole," he said.
However, Peters believes prohibition of the technology is a "fool's errand," since reliable detectors don't exist.
"I think the key is to design assessments that are robust to AI," he said. "That might mean lowering the stakes of homework assignments so the motivation to use AI is decreased.
"I have been using oral exams in my teaching, and I have really enjoyed the chance to get to know my students better, while also ensuring that their learning is their own, not AI's."
His biggest takeaway, however, is that "'good' teaching is still good teaching.
Newsweek has reached out to OpenAI for comment via email.