I wonder if a longer bench run (like -bench 360) would show a more clear difference.
Hyperthread may not make as much difference as i remember, newer cpu may have lessoned any penalties for non multithreaded apps.
There is a big difference with a lot of games between a short benchmark and a longer one, a lot of games have long boot sequences that can increase or in a few cases decrease benchmark scores, a lot of games have a fairly static title and high score screens, so running a longer benchmark often results in a closer to real world result as they spent more time in the attract mode.
Back when I was benchmarking MAME all the time, I used a 240 benchmark for this reason.
If all else fails, Burn the manual.