Glen Pitt-Pladdy :: BlogSolr on Debian with PHP | |||
For a long time I've been using Google Custom Search Engine (CSE) on this blog but it has a number of annoying things. Besides being yet another bunch of things (Javascript mainly) that needs to run in the page on load and slows the page load down a lot, the search is not always as sharp as I would like and I find often triggers a load of warnings in browser privacy ad-ons. Sites having their own search is also useful in that it can be more tailored to the site without the need to be paranoid within a closed system as public search engines have to be. There is really no point in trying to game the search on your own blog :-) This is a quick run through how I got things working which will hopefully be useful to others with similar applications. Solr vs Elastic SearchThese are two relatives (both Lucene derived) but in most cases people go with Elastic Search due to scalability and generally following the heard. For me the practical thing is that my scale is (and likely never will be) large enough to warrant Elastic Search, and Solr is a lot easier to work with on small scale with php5-solr packages in Debian meaning no messing with trying to keep libraries up to date or anything like that. Solr on DebianI've opted for the Jetty option (solr-jetty) as oposed to Tomcat (solr-tomcat) as it seems to be the preferred way of running Solr. I've installed the solr-jetty package, and removed all the unneccessary "Recomended" stuff like GTK/Gnome related libraries which Java wants. These are not going to be used in this case. After install edit /etc/default/jetty8 and enable it to start: NO_START=0 Then restart Jetty (systemd thinks it's already running so just starting isn't enough) with: # systemctl restart jetty8.service At that point you should be able to take a browser (a text one like w3m, elinks, lynx etc. should do) and point it at http://localhost:8080/solr/ and you should be able to get to an admin page like this:
Configuration / Application side of thingsThis is easy - just install php5-solr to get the PHP class. Before we actually start coding, we need to look at the schema. Out the box Solr has an example schema which is huge and covers just about everything, but for practicality (and performance) a lighter purpose built schema is worth looking at first. Solr SchemaIn my case I really only want to be searching text with the standard parts of a web page. Meta fileds are often ignored by search engines due to containing misleading information, however as I keep mine clean (and as above, why would I want to game my own search?) I'll be including them in my index. I've heavily pruned /etc/solr/conf/schema.xml and ended up with the fields: .... In many cases, this included, it's actually best to search multiple fields together. This can be accomplished in multiple ways, but the most sensible is normally to use "copyField". In this case we can use one of our fields (or create a new one) for holding multiple values of other fields for searching. This is more efficient than searching multiple fields individually. We end up with something like: .... This copies these fields into the SearchText field which is indexed but not stored (we don't want to retrieve the data back). Also note that this field is multiValued - without this you can't copy all the different values into the field. Other ConfigFirstly, look at /etc/solr/conf/elevate.xml and comment the example <query> blocks, leaving the <elevate> block, unless you want them, in which case update these to match your schema. If you don't then you might get errors relating to mismatches of these fields. Secondly, look at /etc/solr/conf/solrconfig.xml and find this section: <requestHandler name="/select" class="solr.SearchHandler"> This is the default field that will be queried, however if you don't have a field named "Text" (case sensitive) or you need a different field then you had better change this. Also note that in the example above using copyField, we would actually query the "SearchText" field which aggregates all the fields we want to search together. At that point you are good to restart Solr (via Jetty) with: # systemctl restart jetty8.service
PHP ApplicationGetting setupI'm using php5-solr, but after installing it you will need to enable it to make the classes available to applications: # php5enmod solr OR # php5enmod solr ... depending on how you run your PHP. In my case I run both for test & dev environments to ensure compatibility and coding discipline. At this point we should be able to start coding. Maintaining an IndexIt's worth putting some thought into this. I have a load of existing data to index and I've chosen to have the regular periodic maintenence routine automatically updating the index like a conventional search engine. In reality, I will be adding calls to add/remove articles from the index when I use the editor so that on publishing/updating an article it's automatically reindexed. This will make the auto-indexing code obsolete, but it serves it's purpose as a transitional tool. There are two parts to keeping the index up to date: Updating the index for changes or new articles in the database. This is acomplished with a column in a SQL table for the Last Index Update time, for convenience as a Unix Epoch so it's easy to do numeric comparisons. Updating the index for removed / changed articles in the database. This is a bit trickier as if an article is removed the index remains (for now anyway, until I get editor based updating sorted) I'm then querying Solr for IDs based on the oldest LastUpdated field in Solr which again is a Unix Epoch. PHP Code - Adding/Updating DataAdding data with the same field used as the "uniqueKey" replaces the index that is already there. function solradd ( $id, $info ) { Basically you addfield everything you want to add to a SolrInputDocument and then addDocument it into Solr. If you need authentication etc. then you need to put that in the SolrClient line. PHP Code - Removing DataMuch simpler - we can remove data by ID like this: function solrdel ( $id ) { PHP Code - Querying oldest indexesThis allows us to check for orphaned articles in the index: function solrgetoldest () { In this case we return the 10 oldest indexed articles. If they are being regularly re-indexed then any orphans will eventually end up as the oldest indexed items. PHP Code - SearchingThere's not a lot of point in a search engine without the ability to search. This is a basic search, returning the top 10 hits: function solrsearch ( $search, $page=1 ) { You can addField more fields if you want more info returned, plus a load of other information.
|
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|