Emergent Analogical Reasoning in Large Language Models: A Replication with Open-weights Alternatives
Webb, Holyoak & Lu (2023) compared human reasoners and GPT-3 on several tasks involving analogy resolution, documenting human-level or superhuman performance in most conditions. In this direct replication, we tested a different, open-weights language model (Mixtral-8x7B) on the same materials (Experiments 1 and 2, “Digit Matrices” and “Letter String” problems) or on an augmented dataset to obtain the desired statistical power (Experiment 4, “Story Analogies”). Our replication confirmed the sign and statistical significance of the reported effects in the Digit Matrices and Story Analogies problems, whereas the model did not surpass the human baseline in the Letter String problems