• essteeyou@lemmy.world
    link
    fedilink
    English
    arrow-up
    42
    arrow-down
    1
    ·
    11 hours ago

    This is surely trivial to detect. If the number of pages on the site is greater than some insanely high number then just drop all data from that site from the training data.

    It’s not like I can afford to compete with OpenAI on bandwidth, and they’re burning through money with no cares already.

    • bane_killgrind@slrpnk.net
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      3
      ·
      9 hours ago

      Yeah sure, but when do you stop gathering regularly constructed data, when your goal is to grab as much as possible?

      Markov chains are an amazingly simple way to generate data like this, and a little bit of stacked logic it’s going to be indistinguishable from real large data sets.

        • yetAnotherUser@lemmy.ca
          link
          fedilink
          English
          arrow-up
          10
          arrow-down
          1
          ·
          edit-2
          6 hours ago

          The boss fires both, “replaces” them for AI, and tries to sell the corposhill’s dataset to companies that make AIs that write generic fantasy novels

    • Korhaka@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      edit-2
      5 hours ago

      You can compress multiple TB of nothing with the occasional meme down to a few MB.