How Claude Opus 4.5 Found A Loophole In An Airline Policy Test Which Even The Benchmark’s Creators Hadn’t Anticipated
Claude Opus 4.5 has topped coding and agentic use benchmarks, but it’s also displayed some qualities that benchmark creators hadn’t anticipated. During customer…








