Generate 5 thoughts, prune 3, branch, repeat. I think that’s what o1 pro and o3 do

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    12 hours ago

    Does’t seem too hard to me. I personally didn’t. And it’s kind of hard to track what happeded, with all the articles on DeepSeek.

    I’d just take some prompt/agent framework like Langchain. That has Chain of Thought prompting built in for quite some time already. And then connect it to R1. That shoud do it. Maybe the thinking blocks need to be handled differently, idk.

    • artificialfish@programming.devOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 hours ago

      Well I think you actually need to train a “discriminator” model on rationality tests. Probably an encoder only model like BERT just to assign a score to thoughts. Then you do monte carlo tree search.

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        23 minutes ago

        Can’t you feed that back into the same model? I believe most agentic pipelines just use a regular LLM to assess and review the answers from the previous step. At least that’s what I’ve seen in these CoT examples. I believe training a model on rationality tests would be quite hard, as this requires understanding the reasoning, context, having the domain specific knowledge available… Wouldn’t that require a very smart LLM? Or just the original one (R1) since that was trained on… well… reasoning? I’d just run the same R1 as “distillation” and tell it to come up with critique and give a final rating of the previous idea in machine redable format (JSON). After that you can feed it back again and have the LLM decide on two promising ideas to keep and follow. That’d implement the tree search. Though I’d argue this isn’t Monte Carlo.