AI, personalised tutoring and the 2 sigma problem

Martin Hall

“We’re at the cusp of using AI for probably the biggest positive transformation that education has ever seen.” Sal Khan

Sal Khan was speaking in Vancouver in April this year and, since then, Khan Academy has been partnering with public schools in New Jersey to provide AI  in classrooms.  Studies in South Africa estimate that, in 20% of  the schools that serve a majority of the population, teachers  have classes of around 80 students, making any form of personalised guidance impossible. A study by the Institute of Education, University College London found that in England,  secondary school teachers work an average of more than 52 hours a week of which more than 32 hours – 62% – is on non-teaching tasks. Here, as in healthcare,  AI will take over a wide range of administrative tasks, freeing teachers to teach.

The Holy Grail of generative AI in schooling is personalised tutoring; Benjamin Bloom’s “2 sigma problem”, formulated in 1984 and long unresolved.

In their compelling mix of computer science and science fiction, Kai-Fu Lee and Chen Qiufan imagine 2041 and a school in which every student has an automated personal tutor:

What’s the chance of humans being able to form relationships with sophisticated AI companions within twenty years? For children, there is no doubt it can happen. Children already have a universal tendency to anthropomorphize toys, pets, and even imaginary friends. This is a phenomenal opportunity to design AI companions that can help children learn in a personalized way, and practice creativity, communications, and compassion—critical skills for the era of AI. AI companions that can speak, hear, and understand like humans could make a dramatic difference in a child’s development.

What is the case for making personalised tutoring available for all, whatever their circumstances?

Bloom, Professor of Education at the University of Chicago, had developed the principles of “mastery learning“; the requirement that a student must demonstrate a comprehensive understanding of a concept before moving on. In the early 1980s, and with his team of graduate students, Bloom set up controlled studies which enabled statistical comparisons between three modes of teaching: a conventionally-taught class of about 30, in which the teacher presented and the students asked questions; a class of 30 in which the teacher followed the principles of mastery learning; and small group and individual tutoring, also following the principles of mastery learning.

The differences in learning outcomes for the three teaching approaches were striking.  Using conventional classroom teaching as the control, Bloom noted that the average student taught in a class of 30 but using the techniques of mastery learning was about one standard deviation above the average, while students who were individually tutored using the principles of mastery learning achieved learning outcomes about two standard deviations – two sigmas – above the average outcomes for a conventionally taught class. This meant that about 90% of the individually tutored students reached a level of attainment only reached by 20% of the students taught in the conventional classroom.  But at the same time, rolling out individualised tutoring would be highly demanding of teachers’ time, and prohibitively expensive:

The tutoring process demonstrates that most of the students do have the potential to reach this high level of learning. I believe that an important task of research and instruction is to seek ways of accomplishing this under more practical and realistic conditions than one-to-one tutoring, which is too costly for most societies to bear on a large scale. This is the ‘2 sigma’ problem.

Bloom’s point about cost and scalability is crucial.  The difference between a conventional class of 30 and individual tuition can be mapped as a spectrum along which average attainment levels increase as class size diminishes, and there is a large body of literature describing this. A 2011 study by the Brookings Institution found that the average student:teacher ratio in public schools across all states in the US was 15.3 (the ratio has stayed much the same since then).  Reducing this ratio by just one student would have cost about $12 billion a year in additional teachers’ salaries as well as extensive costs in new classrooms and infrastructure. Calculations of student:teacher ratios include special needs provision and so class sizes in public schools are on average close to 30, as in Bloom’s study.  This is the major differentiator with private schooling in the USA, with significantly smaller class sizes in private schools, as well as lower student to teacher ratios.  Because in the US education in public schools is free while private schooling costs upwards of $10 000 a year, access to smaller class sizes correlates with the ability to pay.

For South Africa, the policy benchmark for class sizes in public secondary schools is 37 for Grades 8 and 9.  For the higher grades, leading up to the National Senior Certificate at the end of Grade 12, norms vary by subject but are  35 and 37 for languages and mathematics. But because public schools are allowed to charge additional fees, there is a wide range of variation when compared with official norms and standards, with Quintile 5 schools serving the wealthiest 20% offering class sizes that conform with policy expectations, and Quintile 1 to 3 schools having up to 80 students in a class.

Tim Köhler, University of Cape Town’s Development Policy Research Unit, has studied the relationship between class size and learning outcomes across this range of public schooling in South Africa. In his paper published in 2022, Köhler analysed a substantial dataset to see if there is a relationship between academic attainment levels and a school’s socioeconomic status. Taking into account pass rates for the National Senior Certificate at the end of Grade 12, class sizes, and student-to-teacher ratios, he found that:

It is clear that wealthier schools on average have smaller class sizes and higher NSC pass rates. … Quintile 1–3 schools do not differ significantly from one another in these aspects, while there are 20 more learners in the average class in the poorest 60% of schools relative to the wealthiest 20%. This coincides with amore than 30 percentage point difference in inter-quintile NSC pass rates: just 56% of learners in Quintile 1–3 schools pass Grade 12 with an NSC, in contrast to 87% of learners in Quintile 5 schools.

This is shown in the graph below.  For the Quintile 4 and 5 schools in Köhler’s dataset, it is clear that pass rates in the examinations at the end of Grade 12 go up as class sizes come town. But for the poorer schools in Quintiles 1 -3, the measures of class size and attainment are flat, and there is no evident advantage in being in a slightly better off school in Quintile 3 school than enrolment in a no-fee school in Quintile 1.


The indications that the size of classes and levels of student attainment are not correlated in Quintiles 1 to 3, but are correlated in wealthier Quintiles 1 and 2, shows us that the causal relationship between class size and attainment is complex.  Köhler found that “in schools with a mean class size in the top 20% of the class size distribution, the average teacher teaches a class of just under 80 learners and is less likely to (i) have a postgraduate degree, (ii) have taken mathematics in Grade 12,(iii) be very confident in teaching their subject or phase, and (iv) have received training on supporting learners with learning difficulties”. This suggests that, rather there being a simple causal relationship between class size and academic outcomes, as is often claimed by high-fee schools, class size is, in stats-speak, an “endogenous variable” which changes  according to  its relationship with other variables that are play.

Tim Köhler’s findings were anticipated by Bloom in his original formulation of the 2 sigma problem. Bloom’s point was not that smaller class size, in itself, resulted in better learning outcomes. It was rather than personalised tutoring provided the space for implementing the precise and time consuming protocols of formative assessment and feedback – mastery learning – which would not be possible in a classroom of a single teacher and 80 students. Köhler puts it this way:

Importantly, the conclusion of this paper is not that class size does not matter. Rather, it is that changes in class sizes may not be effective in improving learner outcomes unless other factors change. In other words, the severity of these variables seems to merely be indicative of other important school factors that influence learner outcomes in the South African context.

1984 – the year in which Bloom presented the challenge of the 2 sigma problem – was also the year in which Apple launched the very first Mac with a now-famous 60 second commercial by Ridley Scott in which a hammer-wielding woman frees the masses from George Orwell’s Big Brother, an allusion to the stranglehold of mainstream computing.  But although the conceptual foundations for Artificial Intelligence had long been in place, there was widespread scepticism that this could be rolled out to scale or that it could lead to viable commercial solutions.  Consequently, and like the earlier “Turing Test“, Bloom’s 2 sigma problem has remained a hypothetical challenge.  The hammer blow was to come 38 years later, with the launch of ChatGPT in November 2022. Twelve months on, there is now a range of applications in service or in beta testing, that are designed to provide automated feedback on testing and other forms of personalised tutoring  for different levels of education.

In fields such as healthcare, developing generative AI applications is complex. For example, the Mayo Clinic is leading the way in implementing AI and innovations include Mayo’s “hospital at home”, which will automate the diversion of up to 30% of acute care emergencies away from hospital admissions, as well as an application that will provide patients with an interactive facility to obtain detailed and reliable responses to their symptoms. Both of these require extensive and wide-ranging AI training data as well as complex prompt engineering.

In contrast, the data sets required for training AI to respond to questions in secondary-level education are far more constrained. Curricula such as South Africa’s National Senior Certificate, or Britain’s A-Levels, are tightly defined and fully described. Banks of past examination papers, along with the memoranda that are used by examiners for marking, are readily available.

When it comes to prompt engineering, Bloom’s original formulation of mastery learning serves as a ready-made template for designing sets of instructions for automating personal tutoring. Thomas Guskey has provided us with a neat diagram that shows how mastery learning should be implemented, also serving as a storyboard for an AI development project.

Mastery learning works best when the curriculum is broken down into units, each of which covers a closely defined set of content. As shown in the diagram above, early in each unit students take an initial formative assessment based on well defined learning objectives, and receive feedback on their responses, which identifies the areas on which they need to focus, and for which they receive “correctives”.

Following this, students take what Bloom called a “parallel assessment”, which covers the same concepts and skills as the first, but includes slightly different problems or questions. This second assessment serves both to establish that the “correctives” have served their purpose, while also motivating students by showing them that they have moved forward in their learning.

Because students will move at different paces through a unit of learning, based on their levels of prior knowledge and their abilities, Bloom provided for enrichment activities allowing students who show a high level of competence in the initial assessment exercise to dive more deeply into the subject matter.

A personalised AI tutor working within this framework would first provide each student with feedback on the initial assessment – “Formative Assessment A” in the diagram. It would then direct the student towards the specific parts of the curriculum on which the student needs to concentrate in order to correct errors and bridge gaps in existing knowledge. Finally, the AI tutor would select appropriate items from a question bank for “Formative Assessment B”, serving to establish how far he student has moved on, and would also provide customised feedback to the students as they move on to the next unit in the sequence.

Today, the potential of generative AI promises to provide affordable personalised tutoring at scale, freeing teachers to teach and, with appropriate organisational changes, providing equity of access to learning across unequal school systems.

This would finally solve Benjamin Bloom’s 2 sigma problem, an outcome which he foresaw as “an educational contribution of the greatest magnitude”.


Allen, R., A. Benhenda, J. Jerrim and S. Sims (2019). New evidence on teachers’ working hours in England. An empirical analysis of four datasets. London, UCL Institute for Education.

Bloom, B. S. (1984). “The 2 Sigma Problem: the search for methods of group instruction as effective as one-to-one tutoring.” Educational Researcher 13(6): 4-16.

Guskey, T. (2007). “Closing Achievement Gaps: Revisiting Benjamin S. Bloom’s “Learning for Mastery”.” Journal of Advanced Academics 19(1): 8-31.

Hagemeijer, T. (2023). “Many talk about AI, Mayo Clinic is implementing it.” LinkedIn.

Kohler, T. (2022). “Class size and learner outcomes in South African schools: Therole of school socioeconomic status.” Development Southern Africa 39(2): 126-150.

Lee, K.-F. and C. Qiufan (2021). AI 2041. Ten Visions for Our Future, Penguin.

Lohr, S. (2023). A.I. May Someday Work Medical Miracles. For Now, It Helps Do Paperwork. New York Times.

Whitehurst, G. J. and M. Chingos (2011). Class Size: What Research Says and  What it Means for State Policy, Brown Center on Education Policy, Brookings Institution.

Leave a Reply