I’d like to explore a project for NLP based search on the fediverse. But I’m a fediverse beginner and am not sure if it’s possible to index fediverse content.

My general idea is -

  1. Set up my own read-only instance, let’s say of kbin. I’m not sure if the concept of a read-only instance makes sense. It’s read-only because the instance only needs to be able to read the content already on the fediverse and doesn’t need the ability to post content.
  2. At some regular interval, let’s say once a day, monitor any changes in the content from the previous run. I’m not sure if there is a single “fediverse” where all the content can be read from. If not, then I can start with tracking the same content as on kbin.social. Is it possible to monitor changes to content on a kbin instance?
  3. I’ll convert the content into vector embeddings by a using an NLP ML model like CLIP. The embeddings will be stored in a vector store. The vector store will also include the url of the content as metadata.
  4. When a user requests a search, the search term is converted to its vector embedding using the same ML model and the most similar vectors are identified.
  5. The user gets the search results as urls of the most relevant content, and perhaps a preview of the content. The user can then access the full content from where it’s originally posted using its url.

I’m comfortable with setting up steps 3 and 4. But I do not know the fediverse enough to answer whether steps 1, 2, and 5 would work or even make sense how I’m envisioning them.

Can some of the fediverse veterans help me understand if this is a feasible approach or if I’ve got it all wrong?

  • kjr@kbin.social
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    @ofcourse there are instances which defederated an instance because they implented free-text search. There is not agreement on that.
    Steps 1 and 2 are problematic, since the Fediverse is hetetogeneous and not every instance federates which every instance, and not all the content is shared between diffrerent software (i.e. only a part of the content in kbin can be accessed by madtodon).
    Anyway the approach is sound, maybe not for the fediverse, but for groups of instances which agree on a shared search engine.
    I’m not sure about the GPU requirements, but on CPU the updates could be very slow for the actual trafic.

    • ofcourse@kbin.socialOP
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      1 year ago

      Thanks for sharing your insights.

      I’m curious why instances offering free search are defederated? I would have guessed everyone wants better search. Is it because of privacy concerns or instances don’t want to be indexed or have traffic directed elsewhere?

      I was hoping that if I index only for the purpose of embeddings (which would prevent recreating the original content) and only share urls to the content that it should eliminate privacy and traffic concerns.

      I’m still in the process of understanding how and if this would work. It’s only a personal project at this stage but you are right cpu/gpu and vector stores would be things I’d need to consider.