So, in the era of increasingly good AI powered tools and general search engines full of SEO spam, last week I started creating something little old school and against the trends.
For now It’s a have-fun-and-find-out project that main aim is to provide good search results for general web development queries with a special focus on independent blog authors.
The thesis is that no SEO spam website is in the index, which will already filter out most annoying noise on Google/Bing.
Search results are grouped per type: docs, blogs and magazines (e.g. blog platforms or bigger websites).
For now it’s far from being done in terms of having a full index, but in most cases it already replaces my go-to search engine when I’m looking up some stuff during work.
I’m looking forward hearing out what y’all think and if you think it makes sense overall I can only encourage you to post some links to blogs or docs that are still missing in the index. I’m more than happy to add it to the crawler.
Responds like: “nei, total shit, who would need that” also accepted but constructive critique more appreciated ;)
EDIT: everyone many thanks for all your voices and comments. I’m super grateful for all of them and happy that we have such place like Lemmy!
I like how first queries you guys make are attempts to SQL inject and XSS it.
EDIT: if you find something let me know, PRs also welcomed ;)
There was a programming search engine called Symbol Hound that allowed for searching for special symbols like << and &&. It was my fallback search engine while programming if I couldn’t find something on the first page of Google. Sadly, that site appears to have disappeared. Does this search engine have optional support for special characters?
If it has it’s totally accidental.
What’s the use case for searching for those kind of symbols? I’ll check if I can tune it for this.
When you want to know the name of the operator for a language.
Like “what does & mean in c++?”.
&
isn’t too bad, but some of them can be difficult (like “JavaScript ??”).
And if you don’t know it’s called a reference operator, or a bullish coalescence operator, or whatever… Trying to learn what it does can be downright impossibleFor ?? I guess it already has a decent results. I’ll periodically check those kind of cases once the index gets more languages.
?? is intuitive if you (truly) understand ||
Yea, I don’t know why people want to learn things - it’s easier to just know them! /s
Have you tried divining syntax from reading the entrails of a sacrificed goat?
It works on my machine
What? The previous comment said it’s “downright impossible to learn” which is nonsense.
|| means “Evaluate and use the left operand, unless it’s falsy, in which case evaluate and use the right operand instead”
?? means “Evaluate and use the left operand, unless it’s nullish, in which case evaluate and use the right operand instead”
They’re the same thing except for which values fall through to the second operand
Yea, I’m well aware.
But you had to learn that somewhere… nothing is intuitive for a novice.
Exactly. We all learnt it, so “downright impossible to learn” is bullshit
I think the main issue as well as my main question is around scope.
You say targets we developers, but the current index is quite narrow. So will you accept significant expansion of that, as long as it may be relevant to Web developers? Where would you draw lines on mixed c content or technologies?
ASP.NET docs is definitely docs for web developers. But maybe not what you had in mind. Would that apply? The docs are h hosted on a platform with a lot of other docs of the dotnet space. Some may be relevant to “Web developers”, others not. And the line is subjective and dynamic.
My website has some technological development resources and blog posts. But also very different things. Would that fit into scope or not?
How narrow out broad would you make the index?
I guess it’s an index for search, so noise shouldn’t be a problem as long as there are gains through/of quality content.
It’s still in MVP, work in progress, hence the index is not “full”.
For me “web development” is everything that we might need for well, web. Servers, mongo docs all goes into the index (I’m adding it every day basically but also it takes some time to index stuff and I observe how this whole thing works as index grows).
ASP.NET goes into the index of course. If your website has dev resources and blog posts that would go into it as well. Recently one person suggested tons of Haskell blogs and they are being indexed as we speak.
I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.
Where would you draw lines on mixed c content or technologies
For now the line is: does this website have anything that web devs would need? Yes? Then it might get in.
If it’s a blog about locomotive CPU programming then maybe not. Although mostly due to infrastructure costs. Indexing cost in the end but having some non related stuff in the index should not hurt the results.
All of what I wrote is the state for today. I’m changing my mind often as it’s still in “having fun” state.
PS. also thanks for the feedback!
I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.
Yeah, a public platform is unlikely to provide consistent content. If curation is not an explicit goal and practice there, I would not include them for the reasons you mentioned.
If indexing could happen not on domain but with more granular filters - URL base paths - that may be viable. Indexing specific authors on devto.
Good idea. I had this thought once to do some narrow indexing of websites, e.g. stack overflow is a big issue, indexing all of this is crazy, picking up some specific tags on the other hand feels like tons of work. In the end I adjust the whole project as it grows with hope that after every tuning it gets better.
As long as I have fun with it I’ll continue :D
Of course - cutting scope is a good call to keep it manageable and fun, and not end up with creep and what you wanted to evade in the first place. :)
This is a cool idea! I did notice that on mobile the search results are wider than the viewport and if I had a feature request it would be to make them way, way more compact but that might just be me hah.
You should also check out the Lenses feature that Kagi has, I think every search engine needs that feature now hah. I bookmarked your site for the next time I am searching for sure though!
Thx for the comments. I’ll fix the mobile view and will definitely redesign it all a bit over weekend. I see a lot of room for improvements.
Also will check how to submit it to Lenses. Highly appreciate it!
EDIT: mobile view is fixed, also did some small adjustments in the whitespaces between result items.
Yeah! Granted I have an iPhone 12 which is small for a modern phone but I figured I should mention it :D
I have been thinking more about this idea and I love it even more, I feel like domain specific search engines are going to be more and more important in the future as the results of the major search engines get even worse and worse.
Awesome work!
I’m on iPhone 12 mini. I love that small design and I strongly believe phones should be small.
Thanks for the good words! Highly appreciate it!
Index categories are blog, docs, magazines. Have you considered indexing source code websites?
- https://source.dot.net/ provides a web UI to exploring dotnet source code
I thought I would remember a second one, but I can’t recall right now.
Subpaths on GitHub and GitLab would be a similar fashion but would require more specific filters - unless they are projects hosted on dedicated instances.
Project issue tickets may also be very relevant to developer searches!?
Great ideas. For the source code I’m not sure but I’ll put it to the backlog of cool things I get from Lemmy and work on them one by one. Thanks!
Well, I think this may be not a bad idea at all. However, what would really stop me from using your search engine is if my search queries (or anything else I send) were somehow tied to me and/or sold to someone. Please don’t be like Google, Microsoft, or OpenAI.
Ah. This will never happen. I have zero motivation to do any GDPR stuff in this project. Even for analytics I anonymize visitors IPs so plausible don’t get them.
Also in this case it would be nonsense. For general search it makes sense that Bing knows I’m after parceljs when typing „parcel” instead of spedition companies. For such narrow search engine the user persona is known.
Can you add links to each section at the top so you dont have to scroll past ones you might not be interested in?
Another person in real life told the same. Adding to the backlog!
It’s a good start. I’m curious why you didn’t include a section for social media like StackOverflow or Reddit. If I go to Google with a question, it’s usually for an edge case not covered by the documentation. Maybe add them as a section at the bottom to indicate that they might be less relevant?
Also, this might just be a web developer thing, but why include blogs? Almost all coding blogs I’ve seen are SEO cancer that just copy from the documentation or each other. Are there actually useful blogs out there that I’ve just been missing?
- SO and Reddit are on the TODO list. It even had SO (in the bottom indeed) once but not via crawling, via SO Search API. It has very poor quality results and was super slow so I had to remove it while thinking of a better solution. Crawling entire SO might be little too much of this project at this state tho but if I have enough courage and hours at night I might parse that 20GB stack overflow archive dump and try doing something useful with it.
Same for Reddit but here I have mixed feelings about it in general and hope it’s going to die soon being replaced by amazing Lemmy communities.
I also used to type some question and end with “reddit” in Google to get good quality content, but here with kukei the experiment is whether blogosphere can replace it properly when index is promoting it.
- Why blogs?
This is my main thing. To promote good quality blogs that I tried to follow via RSS but somehow never did. Having them all indexed (and more, some Mastodon community gave me amazing links to index) makes me actually visit them often.
For the “SEO cancer” that where curation comes into play. Before crawling I check unknown blogs to me and decide whether something goes in or not.
That makes sense. I really like that the documentation is right at the top; many times all I want to do is find the right page in the official docs. You might want to look at how results are prioritized though: right now when I search for something simple like “how to center a div”, that result from Mozilla’s docs is included but it’s hidden as the second or third result. I would expect the page that’s explicitly about centering a div to be the top result, followed by the docs page for the element itself and maybe pages for flex or grid or something. That’s a really simple example, so maybe it’s not the target of this project, but I would still hope that simple topics are covered just as well as complex ones.
EDIT: I was a bit mistaken: “how to center a div” does bring up the Mozilla documentation for centering an element, but “center a div” brings up a page about accessibility as the top result.
I really like the simple design that separates the results into docs/blogs/magazines. Obviously, the results reflect the current state, but I appreciate your approach in both the design & sourcing the search results. I think there’s a lot of potential for this to be a regular part of my toolbox, hopefully this takes off!
Thanks for the kind words!
Looks cool, I think I’ll add it to firefox and use sometimes
Thanks! If you have some suggestions in the future I’m always open to hear
How is it specifically dev focused? How will the crawler know that the site or page is dev related?
The crawler takes only the sources that are defined in the crawler repo (it’s open source, check the github org or kukei-spider).
So in this way it’s “curated” in a sense that it would not add anything else to the index.
Oh, what an interesting idea! I like this, on Monday I’ll test out switching to this as my main search engine for work and try to report back how it goes!
Thanks but don’t expect too much yet. Many sources are still missing. If you notice something should be there but it’s not even being crawled feel free to reach me one Mastodon or add it directly via PR here: https://github.com/Kukei-eu/spider/blob/main/index-sources.js