Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Solr on Debian with PHP

For a long time I've been using Google Custom Search Engine (CSE) on this blog but it has a number of annoying things. Besides being yet another bunch of things (Javascript mainly) that needs to run in the page on load and slows the page load down a lot, the search is not always as sharp as I would like and I find often triggers a load of warnings in browser privacy ad-ons.

Sites having their own search is also useful in that it can be more tailored to the site without the need to be paranoid within a closed system as public search engines have to be. There is really no point in trying to game the search on your own blog :-)

This is a quick run through how I got things working which will hopefully be useful to others with similar applications.

Solr vs Elastic Search

These are two relatives (both Lucene derived) but in most cases people go with Elastic Search due to scalability and generally following the heard. For me the practical thing is that my scale is (and likely never will be) large enough to warrant Elastic Search, and Solr is a lot easier to work with on small scale with php5-solr packages in Debian meaning no messing with trying to keep libraries up to date or anything like that.

Solr on Debian

I've opted for the Jetty option (solr-jetty) as oposed to Tomcat (solr-tomcat) as it seems to be the preferred way of running Solr.

I've installed the solr-jetty package, and removed all the unneccessary "Recomended" stuff like GTK/Gnome related libraries which Java wants. These are not going to be used in this case.

After install edit /etc/default/jetty8 and enable it to start:

NO_START=0

Then restart Jetty (systemd thinks it's already running so just starting isn't enough) with:

# systemctl restart jetty8.service

At that point you should be able to take a browser (a text one like w3m, elinks, lynx etc. should do) and point it at http://localhost:8080/solr/ and you should be able to get to an admin page like this:

Solr Admin Page in w3m

 

Configuration / Application side of things

This is easy - just install php5-solr to get the PHP class. Before we actually start coding, we need to look at the schema. Out the box Solr has an example schema which is huge and covers just about everything, but for practicality (and performance) a lighter purpose built schema is worth looking at first.

Solr Schema

In my case I really only want to be searching text with the standard parts of a web page. Meta fileds are often ignored by search engines due to containing misleading information, however as I keep mine clean (and as above, why would I want to game my own search?) I'll be including them in my index.

I've heavily pruned /etc/solr/conf/schema.xml and ended up with the fields:

....
 <fields>
   <field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="Title" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="Description" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="Keywords" type="text_en" indexed="true" required="false" multiValued="true" />
   <field name="Text" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="LastIndexed" type="int" indexed="true" stored="true" required="true" multiValued="false" />
 </fields>
 <uniqueKey>id</uniqueKey>
....

In many cases, this included, it's actually best to search multiple fields together. This can be accomplished in multiple ways, but the most sensible is normally to use "copyField". In this case we can use one of our fields (or create a new one) for holding multiple values of other fields for searching. This is more efficient than searching multiple fields individually. We end up with something like:

....
 <fields>
   <field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="Title" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="Description" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="Keywords" type="text_en" indexed="true" required="false" multiValued="true" />
   <field name="Text" type="text_en" indexed="true" required="true" multiValued="false" />
   <field name="LastIndexed" type="int" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="SearchText" type="text_en" stored="false" indexed="true" required="true" multiValued="true" />
</fields>
 <copyField source="Title" dest="SearchText" />
 <copyField source="Keywords" dest="SearchText" />
 <copyField source="Description" dest="SearchText" />
 <copyField source="Text" dest="SearchText" />
 <uniqueKey>id</uniqueKey>
....

This copies these fields into the SearchText field which is indexed but not stored (we don't want to retrieve the data back). Also note that this field is multiValued - without this you can't copy all the different values into the field.

Other Config

Firstly, look at /etc/solr/conf/elevate.xml and comment the example <query> blocks, leaving the <elevate> block, unless you want them, in which case update these to match your schema. If you don't then you might get errors relating to mismatches of these fields.

Secondly, look at /etc/solr/conf/solrconfig.xml and find this section:

  <requestHandler name="/select" class="solr.SearchHandler">
....
       <str name="df">Text</str>
....

This is the default field that will be queried, however if you don't have a field named "Text" (case sensitive) or you need a different field then you had better change this.

Also note that in the example above using copyField, we would actually query the "SearchText" field which aggregates all the fields we want to search together.

At that point you are good to restart Solr (via Jetty) with:

# systemctl restart jetty8.service

 

PHP Application

Getting setup

I'm using php5-solr, but after installing it you will need to enable it to make the classes available to applications:

# php5enmod solr
# systemctl reload apache2.service

OR

# php5enmod solr
# systemctl reload php5-fpm.service

... depending on how you run your PHP. In my case I run both for test & dev environments to ensure compatibility and coding discipline.

At this point we should be able to start coding.

Maintaining an Index

It's worth putting some thought into this. I have a load of existing data to index and I've chosen to have the regular periodic maintenence routine automatically updating the index like a conventional search engine.

In reality, I will be adding calls to add/remove articles from the index when I use the editor so that on publishing/updating an article it's automatically reindexed. This will make the auto-indexing code obsolete, but it serves it's purpose as a transitional tool.

There are two parts to keeping the index up to date:

Updating the index for changes or new articles in the database. This is acomplished with a column in a SQL table for the Last Index Update time, for convenience as a Unix Epoch so it's easy to do numeric comparisons.

Updating the index for removed / changed articles in the database. This is a bit trickier as if an article is removed the index remains (for now anyway, until I get editor based updating sorted) I'm then querying Solr for IDs based on the oldest LastUpdated field in Solr which again is a Unix Epoch.

PHP Code - Adding/Updating Data

Adding data with the same field used as the "uniqueKey" replaces the index that is already there.

    function solradd ( $id, $info ) {
        $solrclient = new SolrClient ( array ( 'hostname' => 'localhost', 'port' => 8080 ) );
        $doc = new SolrInputDocument ();
        $doc->addfield ( 'id', $id );
        $doc->addfield ( 'Title', $info['Title'] );
        $doc->addfield ( 'Description', $info['Description'] );
        foreach ( $info['Keywords'] as $keyword ) {
            $doc->addfield ( 'Keywords', $keyword );
        }
        $doc->addfield ( 'Text', $info['Text'] );
        $doc->addfield ( 'LastIndexed', time() );
        $this->solrclient->addDocument($doc)->getResponse();
        $solrclient->commit();
        // maybe do something with returns from the above lines to verify success
        // if maintaining a Last Updated time in the SQL database the put it here
    }

Basically you addfield everything you want to add to a SolrInputDocument and then addDocument it into Solr. If you need authentication etc. then you need to put that in the SolrClient line.

PHP Code - Removing Data

Much simpler - we can remove data by ID like this:

    function solrdel ( $id ) {
        $solrclient = new SolrClient ( array ( 'hostname' => 'localhost', 'port' => 8080 ) );
        $solrclient->deleteById ( $id );
        $this->solrclient->commit();
        // maybe do something with returns from the above line to verify success
    }

PHP Code - Querying oldest indexes

This allows us to check for orphaned articles in the index:

    function solrgetoldest () {
        $solrclient = new SolrClient ( array ( 'hostname' => 'localhost', 'port' => 8080 ) );
        $query = new SolrQuery ();
        $query->setQuery ( '*' );
        $query->setStart ( 0 );
        $query->setRows ( 10 ); // we will only take 10 oldest records per run
        $query->addField ( 'id' )->addField ( 'LastIndexed' );
        $query->addSortField ( 'LastIndexed', SolrQuery::ORDER_ASC );
        $response = $solrclient->query($query)->getResponse();
        return $response['response']['docs'];
    }

In this case we return the 10 oldest indexed articles. If they are being regularly re-indexed then any orphans will eventually end up as the oldest indexed items.

PHP Code - Searching

There's not a lot of point in a search engine without the ability to search. This is a basic search, returning the top 10 hits:

    function solrsearch ( $search, $page=1 ) {
        $solrclient = new SolrClient ( array ( 'hostname' => 'localhost', 'port' => 8080 ) );
        $query = new SolrQuery ();
        $query->setQuery ( $search );
        $query->setStart ( 0 );
        $query->setRows ( 10 );
        $query->addField('Title')->addField('id');
        $response = $solrclient->query($query)->getResponse();
        return $response;
    }

You can addField more fields if you want more info returned, plus a load of other information.

 

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately