Language Support for Marginalia Search

(marginalia.nu)

82 points | by Bogdanp 3 hours ago

4 comments

  • ofalkaed 2 hours ago
    Surprisingly informative for what is pretty much a press release, learned a good deal about search engines.
    • marginalia_nu 2 hours ago
      (author)

      I'm kinda allergic to writing "I did the thing" posts, so I can't help but tryhard and attempt to make them compelling somehow.

      Writing in this manner is also very helpful in making sense of the work for myself. Takes a better understanding of the subject to thoroughly explain what you've built than to merely build it. Sometimes I've gone back and read through one of these updates to just get a refresher on what my thinking was when I built something.

      • ofalkaed 2 hours ago
        In my experience, that is pretty much what marginalia search is. I rarely get what I expect but I always get something very interesting that makes me understand my expectations better which is very helpful in accomplishing my goals. Thanks for your work, marginalia is probably my favorite little corner of the web.
  • mariusor 1 hour ago
    Off topic, but would there be a way to integrate marginalia with a specific website? Similarly to how people use google search for their forums or how HN uses algolia?

    I'm asking this as one of my projects is a link aggregator similar to old reddit (and HN to some extent) and I would like to be able to present to users a search box, but without having to implement document indexing and search. (I assume ad principio that the website is already aligned ethically and technologically with what Marginalia stands for :D)

    • marginalia_nu 1 hour ago
      Should be soon-ish. I'm working right now on laying the ground works for ad-hoc domain filters. That's technically already possible but comes at a too big performance impact that it deteriorates the search results.

      When it works, one of the things I have in mind is making a site search-esque functionality available, as well as exposing it via the public API so that it can be whiteboxed.

  • reedf1 2 hours ago
    Took me too long to realize this wasn't a tool to search for marginalia in scanned manuscripts.
  • internet_points 1 hour ago
    What tools/data do you use for pos-tagging? I'm guessing it has to be fast, to run without a google data center :)
    • marginalia_nu 1 hour ago
      I'm using RDRPosTagger[1], though I've optimized the code a bit so that it's not just algorithmically efficient, but to use the language in a way that is fast. It isn't perfect, but it's good enough to be useful.

      Language detection and sentence splitting are the other two slow bits of processing.

      [1] https://github.com/datquocnguyen/RDRPOSTagger