Predicting Achievement in the Early Years: How Influential is ...

Predicting Achievement in the Early Years: How Influential is ...

Teacher Assessment versus Exams Peter Tymms CEM, Durham University www.cemcentre.org Overview

The Issue The importance of LAs, Schools and teachers Fairness and bias Coverage and sampling Teacher assessment Exams and tests Predictive validity Conclusions

The Issue Teacher assessment is unfair because it is unreliable and biased. Exams are simply snapshots and are unrepresentative of the work that has really be done Which matters most? 1. 2. 3. 4.

LA School Teacher Pupil Newcastle Commission: Data Sources Several national datasets including ASPECTS, PIPS, MidYIS & YELLIS KS1, KS2, KS3 & GCSE Looked a value-added using 3 level multilevel models

Example using KS2 English 2.00 1.00 0.00 -1.00 -2.00 -3.00

Pupil raw Pupil valueadded School School LEA raw LEA raw value value added added 2.00 1.00

0.00 -1.00 -2.00 -3.00 Pupil raw Pupil valueadded School School LEA raw LEA raw

value value added added 2.00 1.00 0.00 -1.00

-2.00 -3.00 Pupil raw Pupil valueadded School School LEA raw LEA raw value value added added

2.00 1.00 0.00 -1.00 -2.00 -3.00 Pupil raw Pupil valueadded

School School LEA raw LEA raw value value added added 2.00 1.00 0.00

-1.00 -2.00 -3.00 Pupil raw Pupil valueadded School School LEA raw LEA raw value value

added added 2.00 1.00 0.00 -1.00 -2.00

-3.00 Pupil raw Pupil valueadded School School LEA raw LEA raw value value added added Willms Diagram

The Teacher Effect Repeated Boosts: Vocabulary 5 Levels 4 3 2 1 0 ER

Y1 Y2 Y3 Year Y4 Y5 Y6

Which matters most? 1. 2. 3. 4. LA School Teacher Pupil Conclusion

Pupils vary enormously Teachers have the greatest impact Schools are relevant Authorities hardly vary at all Hypothesis The best teachers will be best at judging their students What is bias? Bias appears in a test when part of an assessment is harder for a particular group. Or when an assessor systematically

downgrades a group or an individual for construct irrelevant reasons Example of item bias Pigeon Turtle Examples of teacher bias

Annecdote By Sex (eg baseline & page 17 Harlen) By ability judgement anchored by experience By Ethnicity assault experiments By social class By behaviour (origin of ability testing. Binet) By Age (EPICure study) By incident eg spilling a glass of water.

The halo (or horns) effect (e.g. P scales) P Scales in 2004 speak. listen. read. write. using number shapes sc speaking listening reading writing using number shape sci. enq

life proc. mat. prop phys. proc 0.98 0.86 0.87 0.80 0.84 0.86 0.78 0.80 0.79

0.78 0.86 0.86 0.80 0.84 0.86 0.78 0.80 0.79 0.78 0.93

0.79 0.85 0.85 0.75 0.76 0.75 0.75 0.81 0.87 0.87 0.78 0.79

0.78 0.78 0.89 0.89 0.82 0.81 0.82 0.82 0.93 0.82 0.82

0.83 0.82 0.83 0.83 0.84 0.84 Teacher reliability How should reliability be assessed By looking at the internal consistency of judgements? By looking at the link to external

assessments? By comparing over time? By comparing one teacher with others? Facets model within Rasch measurement Trusting teachers judgement Harlen 2005 The findings of the review by no means constitute a ringing endorsement of teachers assessment; there was evidence

of low reliability and bias in teachers judgements 5-14, Portfolios & single level tests 5-14 assessments What about portfolios? inter-rater very low for maths and writing English teacher levels in SATs early 1990s considerable error later quite common to find teacher = test results single level tests compromised by teacher

judgement Is it OK for teachers to assess their own pupils for High Stakes exams? How does the power to grade affect relationships? Would you give McEnroy a B? Exam/test reliability Typically around 0.9 but Distinguish the assessment of

Convergent questions Divergent questions Exam/test bias Pre-tests are often used to address issues of bias But we put much reliance on judgment. Englands major exams are largely not pre-tested. Are Exams inappropriate snapshots?

Issue 1: Questions must be representative samples of the course under exam conditions. Issue 2: Constraint on the nature of the assessment Multi-method Multi-trait challenge Issue 3: Impact of stress on performance Positive & Negative (links to introversion) Introvert and Extrovert Effort

Stimulus We need to match format to content Some things must be assessed by judgement: Social interactions Quality of research Poetry Art

Some things are best assessed left to tests Mental arithmetic Spelling Phonological awareness Diagnostic assessments (e.g. INCAS) Even so perhaps there is a final arbiter

Predictive validity Developed ability test (MidYIS/IQ/etc) Attainment test (Std Grade/Highers) Later success degree, salary etc Teacher Grade

We need the evidence but .. Prediction is often poor Two major reasons Later Achievement Prediction of Educational Achievement Prior Achievement Later Achievement

Correlation = 0.7 Prior Achievement Later Achievement Select top 15% Prior Achievement Later Achievement

Correlation = 0.39 Prior Achievement Later Achievement Cream top 3%; r=0.19 Prior Achievement So, poor prediction because of Prior selection Variable outcome measures

Conclusion: Judgements or tests? Should we do both? (Profiles) But, how do we ensure that judgements and tests are independent? How can judgements be kept free from bias? Virtually impossible in high stakes tests Essential for formative work References

Campbell, D. T., & Fiske, D. W. (1959). Convergent and Discriminant Validation by the MultitraitMultimethod Matrix. Psychological Bulletin, 56, 81-105. Cooper, B. (1998). Using Bernstein and Bourdieu to understand children's difficulties with "realistic"

mathematics testing: an exploratory study. Qualitative Studies in Education, II(4), 511-532. Eysenck, H. J. (2006) The Biollogical Basis of Personaility.Transaction publishers Harlen, W. (2005). Trusting teachers' judgement: research evidence of reliability and validity of teachers' assessment used for summative purposes. Research Papers in Education, 20(3), 245270. Johnson, S., Hennessy, E., Smith, R., Trikic, R., Wolke, D., & Marlow, N. (2009). The EPICure Study: Academic attainment and special educational needs in extremely preterm children at 11 years. London: Nottingham/London/Warwick. Koretz, D., Stecher, B. M., Klein, S. P. & McCaffrey, D. (1994) The Vermont Portfolio Assessment Program: findings and implications, Educational Measurement: Issues & Practice, 13, 516. Tymms, P. (1997). Value-added Key Stage 1 to Key Stage 2. London: School Curriculum and Assessment Authority. Tymms, P., Jones, P., Albone, S., & Henderson, B. (2009). The first seven years at school. Educational Assessment and Evaluation Accountability, 21, 67-80.

Tymms, P., Merrell, C., Heron, T., Jones, P., Albone, S., & Henderson, B. (2008). The importance of districts. School Effectiveness and School Improvement, 19(3), 261-274. Tymms, P., Merrell, C., & Jones, P. (2004). Using baseline assessment data to make international comparisons. British Educational Research Journal, 30(5), 673-689. Willms, J. D. (1987). Differences Between Scottish Educational Authorities in their Examinations Attainment. Oxford Review of Education, 13(2), 211-232.

Recently Viewed Presentations

  • Match a Stems List #3 stem word (column 3) with each picture

    Match a Stems List #3 stem word (column 3) with each picture

    Match a Stems List #3 stem word (column 3) with each picture Last modified by: Darin Nash Company: Cobb County School District ...
  • Cap Sante International Projects

    Cap Sante International Projects

    Laundry EquipmentReplacement . Disassembled at shop. Reassembled In Ship To InstallWithout Cutting The Hull
  • Ercot Market Education

    Ercot Market Education

    Just so everyone is clear, when we speak of a meter, we are speaking of a device that measures electrical usage over some period of time. The time component of this definition is important, because it will help us distinguish...
  • Presumptive Tests For Blood

    Presumptive Tests For Blood

    Negative results must be questioned Acid phosphatase test Semen - Presumptive tests Sodium a-naphthyl phosphate broken down by AP frees naphthyl group Fast Blue o-dianisidine combines with naphthyl group produces scarlet red color Acid phosphatase test Semen - Presumptive tests...
  • Inside the Earth - Thomas County School District

    Inside the Earth - Thomas County School District

    Composition (What it is made of) Crust Mantle Outer Core Inner Core The Crust Outer layer 5-100 km thick 2 types of crust Oceanic Composed of basalt (rock) Continental composed of granite (rock) Oceanic vs. Continental Crust Composition: granite (Rock)...
  • The Criminalization of Immigration: Contexts and Consequences

    The Criminalization of Immigration: Contexts and Consequences

    Internet and other mass media has increased visibility of immigration. Emotional and value-laden narratives of immigration. We are a country that was built by and for immigrants with each generation of immigrants. ... deconstruction and reconstruction.
  • Relationships in Food Webs - birdvilleschools.net

    Relationships in Food Webs - birdvilleschools.net

    The SUN provides the energy for all the food chains on Earth. Producers use the sun's energy to make their own food in the process of photosynthesis. Producers are at the beginning of every food chain & food web. /
  • Chronic Low Back Pain with Lumbar Hyperlordosis: A Case Study

    Chronic Low Back Pain with Lumbar Hyperlordosis: A Case Study

    Patient will increase hip EXT ROM from -30 deg to 0 deg in order to improve gait mechanics. Increased lumbar lordosis/Anterior pelvic tilt. Pain with prolonged standing (1 hour) Patient will be able to perform 10 SL squats on each...