• 4 Posts
  • 140 Comments
Joined 2 years ago
cake
Cake day: August 29th, 2023

help-circle


  • I would give it credit for being better than the absolutely worthless approach of “scoring well on a bunch of multiple choice question tests”. And it is possibly vaguely relevant for the pipe-dream end goal of outright replacing programmers. But overall, yeah, it is really arbitrary.

    Also, given how programming is perceived as one of the more in-demand “potential” killer-apps for LLMs and how it is also one of the applications it is relatively easy to churn out and verify synthetic training data for (write really precise detailed test cases, then you can automatically verify attempted solutions and synthetic data), even if LLMs are genuinely improving at programming it likely doesn’t indicate general improvement in capabilities.



  • So this blog post was framed positively towards LLM’s and is too generous in accepting many of the claims around them, but even so, the end conclusions are pretty harsh on practical LLM agents: https://utkarshkanwat.com/writing/betting-against-agents/

    Basically, the author has tried extensively, in multiple projects, to make LLM agents work in various useful ways, but in practice:

    The dirty secret of every production agent system is that the AI is doing maybe 30% of the work. The other 70% is tool engineering: designing feedback interfaces, managing context efficiently, handling partial failures, and building recovery mechanisms that the AI can actually understand and use.

    The author strips down and simplifies and sanitizes everything going into the LLMs and then implements both automated checks and human confirmation on everything they put out. At that point it makes you question what value you are even getting out of the LLM. (The real answer, which the author only indirectly acknowledges, is attracting idiotic VC funding and upper management approval).

    Even as critcal as they are, the author doesn’t acknowledge a lot of the bigger problems. The API cost is a major expense and design constraint on the LLM agents they have made, but the author doesn’t acknowledge the prices are likely to rise dramatically once VC subsidization runs out.


  • Is this “narrative” in the room with us right now?

    I actually recall recently someone pro llm trying to push that sort of narrative (that it’s only already mentally ill people being pushed over the edge by chatGPT)…

    Where did I see it… oh yes, lesswrong! https://www.lesswrong.com/posts/f86hgR5ShiEj4beyZ/on-chatgpt-psychosis-and-llm-sycophancy

    This has all the hallmarks of a moral panic. ChatGPT has 122 million daily active users according to Demand Sage, that is something like a third the population of the United States. At that scale it’s pretty much inevitable that you’re going to get some real loonies on the platform. In fact at that scale it’s pretty much inevitable you’re going to get people whose first psychotic break lines up with when they started using ChatGPT. But even just stylistically it’s fairly obvious that journalists love this narrative. There’s nothing Western readers love more than a spooky story about technology gone awry or corrupting people, it reliably rakes in the clicks.

    The call narrative is coming from inside the house forum. Actually, this is even more of a deflection, not even trying to claim they were already on the edge but that the number of delusional people is at the base rate (with no actual stats on rates of psychotic breaks, because on lesswrong vibes are good enough).







  • Here’s a LW site dev whining about the study, he was in it and i think he thinks it was unfair to AI

    There a complete lack of introspection. It seems like the obvious conclusion to draw from a study showing people’s subjective estimates of their productivity with LLMs were the exact opposite of right would inspire him to question his subjectively felt intuitions and experience but instead he doubles down and insists the study must be wrong and surely with the latest model and best use of it it would be a big improvement.







  • I think we mocked this one back when it came out on /r/sneerclub, but I can’t find the thread. In general, I recall Yudkowsky went on a mini-podcast tour a few years back. I think the general trend was that he didn’t interview that well, even by lesswrong’s own standards. He tended to simultaneously assume too much background familiarity with his writing such that anyone not already familiar with it would be lost and fail to add anything actually new for anyone already familiar with his writing. And lots of circular arguments and repetitious discussion with the hosts. I guess that’s the downside of hanging around within your own echo chamber blog for decades instead of engaging with wider academia.


  • For purposes of something easily definable and legally valid that makes sense, but it is still so worthy of mockery and sneering. Also, even if they needed a benchmark like that for their bizarre legal arrangements, there was no reason besides marketing hype to call that threshold “AGI”.

    In general the definitional games around AGI are so transparent and stupid, yet people still fall for them. AGI means performing at least human level across all cognitive tasks. Not across all benchmarks of cognitive tasks, the tasks themselves. Not superhuman in some narrow domains and blatantly stupid in most others. To be fair, the definition might not be that useful, but it’s not really in question.