Can AI Kill the Achievement Gap in Education?

Martin Hall

The “achievement gap” is a persistent and pernicious feature of inequality of access to education across all levels of learning, from early schooling to Higher Education. Many studies, across differing education systems, have shown that students’ learning outcomes are affected by levels of economic privilege. Learning outcomes are also mediated by specific legacies of identity and cultural status; ethnic minorities and race in the United States; immigrant communities across the European Union; the long shadow of apartheid in South Africa. The complex ways in which these factors interact have been known and studied since the publication of The Black-White Test Score Gap 25-years ago. But effective solutions have been elusive, and haunted by deeply-rooted biases in curriculum content and assessment systems.

The challenges of the achievement gap came to mind during Professor David Lefevre’s presentation in a recent webinar hosted by Harvard Business Publishing – “Will AI Replace the Educator”. David Lefevre is Professor of Practice at Imperial College Business School, London, where he founded the Edtech Lab in 2004, developing the uses of new digital technologies in education. He summarises his overall objective as building “precision education“, deploying technology to provide students with personalised learning journeys that target their specific needs with maximum efficiency. This would result in analytics that identify with certainty the detail of course content that best addresses a given learner’s needs as well as optimal format for delivering this content – exactly the kind of precision that is needed to take interventions to address the achievement gap to the next level.

One of the reasons that finding solutions for achievement gaps is so elusive is the spaghetti-like complexity of interactions between students’ socialisation and circumstances, educators’ backgrounds and assumptions and the legacies and conventions of education systems; classic case studies of complex systems. One way into this is by focussing on the core work of assessment and feedback, so well conceptualised by Dylan Wiliam:

The teacher’s job is not to transmit knowledge, nor to facilitate learning. It is to engineer effective learning environments for the students. The key features of effective learning environments are that they create student engagement and allow teachers, learners, and their peers to ensure that the learning is proceeding in the intended direction. The only way we can do this is through assessment. That is why assessment is, indeed ,the bridge between teaching and learning.

An appropriately designed form of assessment can provide detailed evidence about individual student achievement at a precise point in their curriculum and “can be used by teachers, learners, or their peers to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have made in the absence of that evidence.”

This is where generative AI enters the picture.

However carefully designed and moderated, most human mediated assessments that form the core of education systems have an inherently subjective element in which the experience and socialisation of the assessor come into play along with institutional traditions and customs. These factors are accentuated at scale because good human assessment requires time and consideration, factors which are at odds with high workloads and tight deadlines. In contrast, careful prompt engineering, that focuses on the key competences that are being tested, has the potential to identify and mitigate the cultural filters that contribute to the achievement gap, and to do so rapidly, and at scale. In this, generative AI would be functioning as both a “guide on the side” and a “dynamic assessor”, two of the ten potential use cases identified by UNESCO in their recent guide to AI in education.

How can we be sure that an AI assessor is as least as good as a fully qualified human? Best practice in conventional assessment is for two assessors to grade anonymised student work independently, with third party moderation if the two independent grades fall outside a defined range of variance. Given this, the minimum requirement for automated assessment is that, for a representative sample of students, the AI assessments consistently fall within the variance that is allowed for two human assessors. 

Significant progress towards achieving this standard has already been met in a number of reported studies. For example, Google’s PaLM 2 has been designed for grading both multiple choice and long form examinations in Medicine and has been rigorously tested through comparisons with conventional grading by physicians, and by human moderation of AI outcomes. While more work is needed to meet the standards required in Medicine, it is clear that this will soon be achieved, allowing this use of AI to be fully integrated into medical education.

Deploying these new and emerging technologies will not, in themselves, resolve the complex issues that cause unequal access to all levels of education. But their use will shift the dial in identifying and correcting trenchant attitudes, practices and policies that prevent so many people, across all levels of education, from achieving their potential.

Jencks, C. and M. Phillips, Eds. (1998). The Black-White Test Score Gap, Brookings Institute Press.

Singhal, K., T. Tu, J. Gottweis, et al. R. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. Ithaca, Cornell University.

UNESCO (2023). ChatGPT and Artificial Intelligence in Education: Quick Start Guide. Paris, UNESCO.

Wiliam, D. (2018). Embedded Formative Assessment (the new art and science of teaching). Bloomington, Solution Tree Press.

Leave a Reply