Stubsack: Stubsack: weekly thread for sneers not worth an entire post, week ending 10th August 2025

BlueMonday1984@awful.systems · 8 days ago

Stubsack: Stubsack: weekly thread for sneers not worth an entire post, week ending 10th August 2025

BigMuffN69@awful.systems · edit-2 4 days ago

Well, after 2.5 years and hundreds of billions of dollars burned, we finally have GPT-5. Kind of feels like a make or break moment for the good folks at OAI~~! With the eyes of the world on their lil presentation this morning, everyone could feel the stakes: they needed something that would blow our minds. We finally get to see what a super intelligence looks like! Show us your best cherry picked benchmark Sloppenheimer!

Graphic design is my PASSION. Good thing the entirety of the world’s economy is not being held up by cranking out a few more points on SWE bench right???

Ok. what about ARC? Surely ya’ll got a new high to prove the AGI mission was progressing right??

Oh my fucking God. They actually have lost the lead to fucking Grok. For my sanity I didn’t watch the live stream, but curiously, they left the ARC results out of their presentation. Even though they gave Francois access early to test. Kind of like they knew this looks really bad and underwhelming.

blakestacey@awful.systems · edit-2 4 days ago

“The word blueberry contains the letter b 3 times.”

Also reported in more detail here:

The word “blueberry” has the letter b three times:

Once at the start (“B” in blueberry).

Once in the middle (“b” in blue).

Once before the -erry ending (“b” in berry). […] That’s exactly how blueberry is spelled, with the b’s in positions 1, 5, and 7. […] So the “bb” in the middle is really what gives blueberry its double-b moment. […] That middle double-b is easy to miss if you just glance at the word.

(via)

Soyweiser@awful.systems · 4 days ago

Graphic design is my PASSION

Wait just how bad is 4? 30% accurate? Did they train it wrong as a joke? Also hatless 5 worse than 3?

BigMuffN69@awful.systems · 4 days ago

Yeah, O3 (the model that was RL’d to a crisp and hallucinated like crazy) was very strong on math coding benchmarks. GPT5 (I guess without tools/extra compute?) is worse. Nevertheless…

BigMuffN69@awful.systems · 4 days ago

The one big cope I’m seeing is in the METR graph ofc. Tiny bump with massive error bars above Grok 4 so they can claim the exponential is continuing while the models stagnate in all material ways.

ebu@awful.systems · 4 days ago

50% success rate? sorry, all this for a coin flip?

ShakingMyHead@awful.systems · edit-2 4 days ago

Looks like they already removed it.

Stubsack: Stubsack: weekly thread for sneers not worth an entire post, week ending 10th August 2025

Stubsack: Stubsack: weekly thread for sneers not worth an entire post, week ending 10th August 2025

Stubsack: weekly thread for sneers not worth an entire post, week ending 4th August 2025 - awful.systems