Goal: better, more focused search for www.cali.org.
In general the plan is to scrape the site to a vector database, enable embeddings of the vector db in Llama 2, provide API endpoints to search/find things.
Hints and pointers.
- Llama2-webui – Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere
- FastAPI – web framework for building APIs with Python 3.7+ based on standard Python type hints
- Danswer – Ask Questions in natural language and get Answers backed by private sources. It makes use of
- PostgreSQL – a powerful, open source object-relational database system
- QDrant – Vector Database for the next generation of AI applications.
- Typesense – a modern, privacy-friendly, open source search engine built from the ground up using cutting-edge search algorithms, that take advantage of the latest advances in hardware capabilities.
The challenge is to wire together these technologies and then figure out how to get it to play nice with Drupal. One possibility is just to build this with an API and then use the API to interact with Drupal. That approach also offers the possibility of allowing the membership to interact with the API too.
Learn how to train a machine learning model to rank documents retrieved in the Solr enterprise search platform.
Looks like I may need to blow dust off the millions of opinions I’ve got sitting in that laptop up in my office.
A look at the Elastic stack, a versatile collection of open source software tools that make gathering insights from data easier. Learn the capabilities, requirements, and interesting use cases that apply to each.
Source: Overview of the Elastic Stack, open source software tools for data insights | Opensource.com
Apache Solr 5.2.0 and Reference Guide for 5.2 available https://lucene.apache.org/solr/news.html
Solr is an open source search server based on Apache Lucene. Lucene provides Java-based indexing and a search library, and Solr extends it to provide a variety of APIs and search functionality, including faceted search and hit highlighting, and handles Word and PDF document searching. It also provides caching and replication, making it scalable, robust, and very fast.
Happily, Solr also plays nicely with Drupal, the popular CMS platform. If you want fast and effective search on your Drupal site, installing Solr is a straightforward way of getting it quickly. Until this month, the Apachesolr Drupal module didnt support the current Solr 4.x schemas, but as of the very latest version of the Apachesolr module, 7.x-1.2, you can now set up Solr 4.x on your Drupal 7 site. This tutorial assumes that youre running Drupal 7.22 the most up-to-date version under Apache on a Linux box.
via How to set up Solr 4.2 on Drupal 7 with Apache.
If you running Drupal and have a lot of nodes to index and you’re not using Solr you’re missing out on a lot. Though it takes a bit of config to set up, using Solr to index and search your Drupal site is much better than the stock Drupal search.
Search at Etsy has grown significantly over the years. In January of 2009 we started using Solr for search. We used the standard master-slave configuration for our search servers with replication.
All of the changes to the search index are written to the master server. The slaves are read-only copies of master which serve production traffic. The search index is replicated by copying files from the master server to the slave servers. The slave servers poll the master server for updates, and when there are changes to the search index the slave servers will download the changes via HTTP. Our search indexes have grown from 2 GB to over 28 GB over the past 2 years, and copying the index from the master to the slave nodes became a problem.
Brilliant use of BitTorrent to solve a difficult problem.
A mention in the BeSpecific blog tipped me off to an interesting project called CourtListener.com. From the about page:
The goal of the site is to create a free and competitive real time alert tool for the U.S. judicial system.
At present, the site has daily information regarding all precedential opinions issued by the 13 federal circuit courts and the Supreme Court of the United States. Each day, we also have the non-precedential opinions from all of the Circuit courts except the D.C. Circuit. This means that by 5:10pm PST, the database will be updated with the opinions of the day, with custom alerts going out shortly thereafter.
The site was created by Michael Lissner as a Masters thesis project at UC Berkley School of Information.
A quick perusal of the site and its associated documents tells us that Michael is using a scraping technique to visit court websites looking for recently released opinions. Once found, the opinions are retrieved, converted from PDF to text, indexed, and stored. Atom RSS feeds are then generated to provide current alerts.
The site is powered by Python using the Django web framework and is open source, so you can download the code. The backend database is MySQL and search is handled by Sphinx. The conversion from PDF appears to be plain text. If you register on the site you can create custom alerts based on saved searches.
All in all CourtListener.com provides another good source for current Federal appellate court opinions. Be sure to check the coverage page to see how far back the site goes for each court. Perhaps the future will bring an expansion to more courts and jurisdictions.
But it turns out that you can’t easily do such searches in Google any more. Google has become a jungle: a tropical paradise for spammers and marketers. Almost every search takes you to websites that want you to click on links that make them money, or to sponsored sites that make Google money. There’s no way to do a meaningful chronological search.
Why We Desperately Need a New (and Better) Google.
Article highlights the failings of Google when it comes to finding plain old information. If your just looking for information Google may not be your best bet. I mean how many times is a random ad going to ask me if I want to buy “Drupal API load_node()”?
Wadhwa suggests an alternative in Blekko, a search tool that lets you use “slashtags” to refine your own searches. Indeed alternate tools for finding information are beginning to appear and perhaps the threat of competition will move Google to clean up its spam ridden indexes.