Is It Ethical to Use AI to Grade?

Save to favorites
Print

Copy URL

By the time English teacher Heather Van Otterloo gets about halfway through marking the stack of essays that her middle school students turn in for any one of the dozens of writing assignments she gives each year, she knows it’s about to hit her: grading fatigue.

She might skip leaving a comment that she would have taken the time to write 17 papers ago or be a bit more lenient with her overall evaluation of a piece. In past years, the hours of out-of-school work that she knew she would have to commit to grading limited the number of writing assignments she gave out.

But this year, Van Otterloo, who teaches 6th, 7th, and 8th graders at South Middle School in the Joplin district in Missouri, has found a way to give more frequent writing practice, and offer students more opportunities to revise their work, all while spending about half the time offering feedback. Artificial intelligence is picking up the slack.

Van Otterloo is part of an emerging group of middle and high school teachers using generative AI to help give feedback on—and, in some cases, score—students’ written work. Some educators have harnessed tools built into learning-management systems and writing platforms; others are loading instructions into open-access models like ChatGPT and asking the AI to evaluate student essays against their criteria.

Teachers who take this approach say it solves an age-old dilemma: Giving kids lots of feedback helps them make their writing better but providing it on a regular basis is a near-impossibility for secondary teachers who have upward of 150 students.

But researchers have found that AI can be biased against certain racial and ethnic groups when evaluating writing. And best-practice guides for using AI in education say there should always be a “human in the loop” when grading student work—a teacher reviewing the decisions that AI makes and offering the final verdict.

Van Otterloo said that’s the case in her classroom. She always looks over the AI tool’s comments before handing them over to students and said she tailors feedback to “the kid that I know.”

Even with those safeguards, though, some experts warn that asking computers to judge subjective questions like the strength of an argument or the persuasiveness of an emotional appeal could have unintended consequences. It’s possible, they say, that AI’s understanding of good writing could influence teachers’ perception, instead of the other way around.

“It’s easy to use these tools,” said Matt Johnson, a principal research director in the research and development division at ETS, an educational testing organization. “But it’s really important for people to understand not only their strengths but their limitations.”

How teachers use AI grading assistants

About a third of teachers use AI tools, according to a fall 2024 EdWeek Research Center survey. But the fraction of those who use them for grading, specifically, is smaller.

Of those teachers who use AI, 13 percent said they used it to grade low-stakes assignments; 3 percent said they used it to grade high-stakes assignments.

Still, it’s possible AI grading could become more common. Digital writing platforms such as NoRedInk and HMH’s Writable offer their own AI-powered grading assistants; other AI tools developed for teachers .

“My assumption is that … increasingly, curriculum providers will incorporate more bespoke AI tools and features,” said Karim Meghji, the chief product officer at Code.org, a nonprofit that seeks to expand computer science education.

TeachAI, an initiative staffed and operated by Code.org that aims to help schools determine the role the technology should play in K-12 education, has put forth some suggestions about how to use grading capabilities. The main takeaway: Human educators should always have the final say on evaluations of student work, even if AI is involved in the process.

Teachers have also put in place their own, similar ethical guardrails.

“I don’t like to use it whole cloth as, the kids submit something, AI grades it, and then I post the grade in the grade book. I never do that—that would be careless,” said Chad Hemmelgarn, a high school English teacher in the Bexley city schools in Ohio.

Instead, Hemmelgarn said, he calls on AI earlier in the writing process, in order to provide students faster feedback as they work on their essays. He runs early drafts through AI and then gives them back to students to use the comments as they revise, before they turn in the final copy.

“AI does that surface-level editing so well,” he said, referencing its ability to catch grammar and spelling problems. “That saves me time to give them more intense, personalized feedback.”

Other teachers do use the technology for grading final products but still build in a human check.

Jen Roberts, a 9th and 12th grade English teacher at Point Loma High School in the San Diego Unified district, said she sometimes uses the AI tool Brisk Teaching to evaluate short pieces of student writing that are graded on a simple scale, like a practice attempt at writing analytical paragraphs.

She first reads through the paragraph and assigns a score, often from 1-4. Then she asks AI to score the paragraph using the same grading criteria—and, importantly, offer feedback.

“As an English teacher with 180 students, and your students all turn [work] in, and you give up three weekends to grade it, it’s not the grading that takes time. It’s the feedback,” Roberts said.

If Roberts has independently decided the paragraph warrants a 4, the highest score, she’ll copy and paste the positive comments AI has made into the student’s work. “The student gets a few nicely put explanations of what they did well,” she said.

But if Roberts gives the paragraph a 2 or 3, she grabs AI’s constructive criticism instead.

The process can cut her grading time by as much as 70 percent to 80 percent, she said, depending on the assignment. It’s the difference between students getting practice work back in one week instead of three, she added.

Where AI evaluations of student writing fall short

Still, Roberts is quick to say AI isn’t perfect. There’s still a need for human review.

The tool she uses doesn’t always catch redundancies, like if a student uses the same text evidence to support two different points, she said. It also isn’t always aware of large organizational mistakes, like starting an essay with a body paragraph instead of an introduction.

More broadly, AI is missing some of the more intangible skills that a good teacher possesses, said Van Otterloo, the Missouri teacher. It doesn’t know individual students’ strengths and weaknesses, or what skills they’re working on mastering.

Sometimes, Van Otterloo’s middle schoolers get off topic or don’t make a transition explicitly. She knows why they might, for instance, bring up the phrase “Chicago-style” in a paragraph about whether ketchup belongs on hot dogs, but the AI thinks it’s a non sequitur. “I understand, so then, I have to teach them [how to make] the connection,” she said.

Research shows that, in general, AI feedback on written work is , said Tamara Tate, the associate director of the Digital Learning Lab at the University of California, Irvine. AI likely wouldn’t be as accurate or insightful as veteran teachers, but it might go toe-to-toe with a newer, less experienced educator, she said.

And what AI feedback lacks in quality, it might make up for it in quantity.

“We know that teachers aren’t giving students more than a paragraph or two [of comments] as it is,” Tate said. If AI feedback could motivate teachers to assign more writing, and students to do more revision, “I think that’s a win,” she said. “You’re teaching habits that are more important than any particular piece of feedback.”

Other research, though, has shown that asking the technology to evaluate student work for a grade—not just offer constructive criticism—can lead to some unwanted outcomes.

Johnson, the ETS researcher, and his colleague Mo Zhang fed ChatGPT a collection of more than 13,000 student essays that had also been graded by expert raters. They found that, on average, ChatGPT scored white, Black, and Hispanic students’ writing .

But the AI penalty was 1.1 points for Asian American students, the highest for any student racial or ethnic group.

It’s hard to know why this difference exists, said Johnson. “We don’t really know what the AI is doing to do the scoring,” he said.

But the finding should prompt teachers, or any educators using these systems for evaluation, to scrutinize their output, said Zhang: “Are they fair for different subgroups of students and are they accurate?”

Roberts acknowledges that AI technology could have biases, but teachers do, too, she said. AI could help mitigate that human error.

“Let’s not pretend that teachers don’t have favorite students, or students they feel sorry for, or students who they know can do it but just didn’t do it on this paper,” said Roberts.

“There are times when the AI is wrong. But there are also times where I am wrong, and I have to admit that,” she said.

Can AI grading tools understand the essence of writing?

But for some, the idea that AI could color teachers’ professional judgment or cause them to second-guess their initial evaluation is troubling.

If AI is doing a first pass on student work, said Johnson, “does it also drive language to align more with what AI likes, rather than what humans like?”

The hypothetical cuts to the core of why schools teach writing at all and what they want their students to be able to do, Tate said.

She posed the question: If students aren’t writing for a human audience, why are they writing?

Initially, some of Van Otterloo’s students felt she was taking an unfair shortcut in using AI to grade their papers. “I feel like they had a bit of a, ‘you’re kind of lazy,’ feeling,” she said.

It’s not an uncommon response from students—several teenagers they felt it was unethical for teachers to use the technology to assess their work, when students themselves are often barred from consulting AI during the writing process.

Van Otterloo doesn’t consider it cheating for her students to use AI to help them research or organize their ideas, though. And she said once students saw the quantity of feedback the tool provided, and the speed at which they received it, they didn’t raise further concerns about her use of AI.

Still, the idea that the essence of writing is in connecting with the reader, human-to-human, shapes how some teachers think about when, and when not, to outsource responding to students’ ideas.

Hemmelgarn, the Ohio high school teacher, said it would be shoddy teaching work to just slap an AI score in the grade book for a student essay without a second look. But beyond that, he said, abdicating the responsibility of reading and considering kids’ writing would betray a deeper level of trust that students give him every time they turn something in.

“Especially if it’s a big writing project, the student has really put themselves out there,” Hemmelgarn said. “Nobody wants to be that vulnerable, and we ask kids to do that every day. I’m going to take that very seriously.”

Sarah Schwartz

Staff Writer, Education Week

Sarah Schwartz is a reporter for Education Week who covers curriculum and instruction.

Coverage of education technology is supported in part by a grant from the Siegel Family Endowment, at . Education Week retains sole editorial control over the content of this coverage.