Around a month ago I posted a poll on this sub asking about feedback relating to a Coomer.su and kemono.su scraper I’ve been developing, and this post is an update to share where development is going.

For anybody unaware, I have been working on a scraping software that allows you to mass download posts from creators on both kemono and Coomer. This is not a built-in feature of their website, which I found to be somewhat stupid, so I set out to create my own tool.

In my previous post, I talked about the basic features the scraping software would have, and many people pointed out that similar software already exists for this. After taking a look at the software provided to me, I felt it did not meet my expectations and quality standards, so I continued forward with this project.

The major driving factor of this scraping software is the built-in translator I have integrated directly into the codebase, allowing for post titles and descriptions to be seamlessly translated as they are scraped, courtesy of Google translate. This feature has exceeded my expectations, with the only downside being Google’s fair rate limit, which can kick in if you translate too many words. This typically only happens with post descriptions and requires upwards of 1k+ words to activate, and thus I feel it is okay in its current state. There is a toggle for translating post descriptions in the code for the time being which defaults to off, and I may add automatic service switching in the future, but for right now, it should work more than well. The translator allows anybody speaking any language to scrape from the PartySites, which is invaluable if your language isn’t widely used on the sites.

I’ve also ported the codebase over to a C# .NET 6 class library for developers, allowing them to create their own scraping software if desired. The project currently has an attached GUI that I am working on refining for the general public.

As I’ve stated before, the concept of this project is extremely simple, with the codebase itself being compiled to a meager 18kb excluding libraries, and thus it surprises me that nobody has programmed this yet to a capacity deemed acceptable.

I plan to release this scraper in the following weeks, once some bugs are sorted out and discord support is possibly added.

Please let me know what you’d like to see in this, as feedback is always appreciated.

  • arr@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    2
    ·
    edit-2
    1 year ago

    Have you asked the operator of those sites if they are fine with this?

    It would be a shame if they were to be taken down because of people scraping the site causing too much traffic costs or something.

    • Fucky Wucky@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      3
      ·
      1 year ago

      Hello,

      I had this same thought, and as I’ve stated in the original post, when this goes public the creators are more then welcome to shoot me a message on GitHub and I’d happily remove it.

      This project however keeps HTTP requests to a minimum and isn’t very different from a normal user browsing the website. The only real load cost is on their CDN server which is probably designed for high traffic environments.

      Out of respect for the developers, I can also modify the user agent of the HTTP requests so they could filter them based specifically on this application if that’s an approach they’d be okay with.

      • arr@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        14
        arrow-down
        1
        ·
        1 year ago

        when this goes public the creators are more then welcome to shoot me a message on GitHub and I’d happily remove it.

        The only real load cost is on their CDN server which is probably designed for high traffic environments.

        I can also modify the user agent of the HTTP requests so they could filter them based specifically on this application if that’s an approach they’d be okay with.

        Why not just message them at their contact email address and ask in advance if your assumption about their CDN server is true, you should set a specific user agent etc.? Then they wouldn’t have to potentially waste time on figuring out what’s happening, writing and deploying filtering/rate limiting logic or finding the repository and contacting you on GitHub.