Saturday, February 26, 2011

Solr: a custom Search RequestHandler

As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework around it, which has served us well so far. The rationale for moving to Solr is driven primarily by the need to expose our search tier as a service for our internal applications. While it would have been relatively simple (probably simpler) to slap on an HTTP interface over our current search tier, we also want to use the other Solr features such as incremental indexing and replication.

One of our challenges to using Solr is that the way we do search is quite different from the way Solr does search. A query string passed to the default Solr search handler is parsed into a Lucene query and a single search call is made on the underlying index. In our case, the query string is passed to our taxonomy, and depending on the type of query (as identified by the taxonomy), it is sent through one or more sub-handlers. Each sub-handler converts the query into a (different) Lucene query and executes the search against the underlying index. The results from each sub-handler are then layered together to present the final search result.

Conceptually, the customization is quite simple - simply create a custom subclass of RequestHandlerBase (as advised on this wiki page) and override the handleRequestBody(SolrQueryRequest, SolrQueryResponse) method. In reality, I had quite a tough time doing this, admittedly caused (at least partly) by my ignorance of Solr internals. However, I did succeed, so, in this post, I outline my solution, along with some advice I feel would be useful to others embarking on a similar route.

Configuration and Code

The handler is configured to trigger in response to a /solr/mysearch request. Here is the (rewritten for readability) XML snippet from my solrconfig.xml file. I used the "invariants" block to pass in configuration parameters for the handler.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  ...
  <requestHandler name="/mysearch" 
      class="org.apache.solr.handler.ext.MyRequestHAndler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score</str>
      <str name="wt">xml</str>
    </lst>
    <lst name="invariants">
      <str name="prop1">value1</str>
      <int name="prop2">value2</int>
      <!-- ... more config items here ... -->
    </lst>
  </requestHandler>
  ...

And here is the (also rewritten for readability) code for the custom handler. I used the SearchHandler and MoreLikeThisHandler as my templates, but diverged from it in several ways in order to accomodate my requirements. I will describe them below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
package org.apache.solr.handler.ext;

// imports omitted

public class MyRequestHandler extends RequestHandlerBase {

  private String prop1;
  private String prop2;
  ...
  private TaxoService taxoService;

  @Override
  public void init(NamedList args) {
    super.init(args);
    this.prop1 = invariants.get("prop1");
    this.prop2 = Integer.valueOf(invariants.get("prop2"));
    ...
    this.taxoService = new TaxoService(prop1);
  }

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {

    // extract params from request
    SolrParams params = req.getParams();
    String q = params.get(CommonParams.Q);
    String[] fqs = params.getParams(CommonParams.FQ);
    int start = 0;
    try { start = Integer.parseInt(params.get(CommonParams.START)); } 
    catch (Exception e) { /* default */ }
    int rows = 0;
    try { rows = Integer.parseInt(params.get(CommonParams.ROWS)); } 
    catch (Exception e) { /* default */ }
    SolrPluginUtils.setReturnFields(req, rsp);

    // build initial data structures
    TaxoResult taxoResult = taxoService.getResult(q);
    SolrDocumentList results = new SolrDocumentList();
    SolrIndexSearcher searcher = req.getSearcher();
    Map<String,SchemaField> fields = req.getSchema().getFields();
    int ndocs = start + rows;
    Filter filter = buildFilter(fqs, req);
    Set<Integer> alreadyFound = new HashSet<Integer>();

    // invoke the various sub-handlers in turn and return results
    doSearch1(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    doSearch2(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    // ... more sub-handler calls here ...

    // build and write response
    float maxScore = 0.0F;
    int numFound = 0;
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      Float score = (Float) sdoc.getFieldValue("score");
      if (maxScore < score) {
        maxScore = score;
      }
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    results.clear();
    results.addAll(slice);
    results.setNumFound(numFound);
    results.setMaxScore(maxScore);
    results.setStart(start);
    rsp.add("response", results);

  }

  private Filter buildFilter(String[] fqs, SolrQueryRequest req) 
      throws IOException, ParseException {
    if (fqs != null && fqs.length > 0) {
      BooleanQuery fquery = new BooleanQuery();
      for (int i = 0; i < fqs.length; i++) {
        QParser parser = QParser.getParser(fqs[i], null, req);
        fquery.add(parser.getQuery(), Occur.MUST);
      }
      return new CachingWrapperFilter(new QueryWrapperFilter(fquery));
    }
    return null;
  }

  private void doSearch1(SolrDocumentList results,
      SolrIndexSearcher searcher, String q, Filter filter, 
      TaxoResult taxoResult, int ndocs, SolrQueryRequest req,
      Map<String,SchemaField> fields, Set<Integer> alreadyFound) 
      throws IOException {
    // check entry condition
    if (! canEnterSearch1(q, filter, taxoResult)) {
      return;
    }
    // build custom query and extra fields
    Query query = buildCustomQuery1(q, taxoResult);
    Map<String,Object> extraFields = new HashMap<String,Object>();
    extraFields.put("search_type", "search1");
    boolean includeScore = 
      req.getParams().get(CommonParams.FL).contains("score"));
    append(results, searcher.search(
      query, filter, maxDocsPerSearcherType).scoreDocs,
      alreadyFound, fields, extraFields, maprelScoreCutoff, 
      searcher.getReader(), includeScore);
  }

  // ... more doSearchXXX() calls here ...

  private void append(SolrDocumentList results, ScoreDoc[] more, 
      Set<Integer> alreadyFound, Map<String,SchemaField> fields,
      Map<String,Object> extraFields, float scoreCutoff, 
      SolrIndexReader reader, boolean includeScore) throws IOException {
    for (ScoreDoc hit : more) {
      if (alreadyFound.contains(hit.doc)) {
        continue;
      }
      Document doc = reader.document(hit.doc);
      SolrDocument sdoc = new SolrDocument();
      for (String fieldname : fields.keySet()) {
        SchemaField sf = fields.get(fieldname);
        if (sf.stored()) {
          sdoc.addField(fieldname, doc.get(fieldname));
        }
      }
      for (String extraField : extraFields.keySet()) {
        sdoc.addField(extraField, extraFields.get(extraField));
      }
      if (includeScore) {
        sdoc.addField("score", hit.score);
      }
      results.add(sdoc);
      alreadyFound.add(hit.doc);
    }
  }
  
  //////////////////////// SolrInfoMBeans methods //////////////////////

  @Override
  public String getDescription() {
    return "My Search Handler";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Configuration Parameters - I started out baking most of my "configuration" parameters as constants within the handler code, but later moved them into the invariants block in the XML declaration. Not ideal, since we still need to touch the solrconfig.xml file (which is regarded as application code in our environment) to change behavior. The ideal solution, given the circumstances, would probably be to use JNDI to hold the configuration parameters and have the handler connect to the JNDI to pull the properties it needs.

Using Filter - The MoreLikeThis handler converts the fq (filter query) parameter into a List of Query objects, because this is what is needed to pass into a searcher.getDocList(). In my case, I couldn't use DocListAndSet because DocList is unmodifiable (ie, DocList.add() throws an UnsupportedOperationException). So I fell back to the pattern I am used to, which is getting the ScoreDoc[] array from a standard searcher.search(Query,Filter,numDocs) call. That is why the buildFilter() above returns a Filter and not a List<Query>.

Connect to external services - My handler needs to connect to the taxonomy service. Our taxonomy exposes an RMI service with a very rich and fine-grained API. I tried to use this at first, but ran into problems because it needs access to configuration files on the local system, and Jetty couldn't see these files because it was not within its context. I ended up solving for this by exposing a coarse grained JSON service over HTTP on the taxonomy service. The handler calls it once per query and gets back all the information that it needs in a single call. Probably not ideal, since now the logic is spread out in two places - I will probably revisit the RMI client integration again in the future.

Layer multiple resultsets - This is the main reason for writing the custom handler. Most of the work happens in the append() method above. Each sub-handler calls SolrSearcher.search(Query, Filter, numDocs) and populates its resulting ScoreDocs array into a List<SolrDocument>. Since previous sub-handlers may have already returned a result, subsequent sub-handlers check against a Set of docIds.

Add a pseudo-field to the Document - There are currently two competing initiatives in Solr (SOLR-1566 and SOLR-1298) on how to handle this situation. Since I was populating SolrDocument objects (this was one of the reasons I started using SolrDocumentList), it was relatively simple for me to pass in a Map of extra fields which are just tacked on to the end of the SolrDocument.

Some Miscellaneous advice

Here is some advice and tips which I wish someone had told me before I started out on this.

For your own sanity, standardize on a Solr release. I chose 1.4.1 which is the latest at the time of writing this. Prior to that, I was developing within the Solr trunk. One day (after about 60-70% of my code was working), I decided to do an svn update, and all of a sudden there was a huge bunch of compile failures (in my code as well as the Solr code). Some of them were probably caused by missing/out-of-date JARs in my .classpath. But the point is that Solr code is being actively developed, and there is quite a bit of code churn, and if you really want to work on the trunk (or a pre-release branch), you should be ready to deal with these situtations.

Solr is well designed (so the flow is kind of intuitive) and reasonably well documented, but there are some places where you will probably need to step through the code in a debugger to figure out what's going on. I am still using the Jetty container in the examples subdirectory. This page on Lucid Imagination outlines the steps you need to run Solr within Eclipse using the Jetty plugin, but thanks to the information on this StackOverlow page, all I did was add some command-line parameters to the java call, like so:

1
2
3
sujit@cyclone:example$ java -Dsolr.solr.home=my_schema \
  -agentlib:jdwp=transport=dt_socket,server=y,address=8883,suspend=n \
  -jar start.jar

and then set up an external debug configuration for localhost:8883 in Eclipse, and I could step through the code just fine.

Solr has very aggressive caching (which is great for a production environment), but for development, you need to disable it. I did this by commenting out all the cache references for filterCache, queryResultCache and documentCache in solrconfig.xml, and changed the httpCaching to use never304="true". All these are in the solrconfig.xml file.

Conclusion

The approach I described here is not as performant as the "standard" flow. Because I have to do multiple searches in a single request, I am doing more I/O. I am also consuming more CPU cycles since I have to dedup documents across each layer. I am also consuming more memory per request because I populate the SolrDocument inline rather than just pass the DocListAndSet to the ResponseBuilder. I don't see a way around it, though, given the nature of my requirements.

If you are a Solr expert, or someone who is familiar with the internals, I would appreciate hearing your thoughts about this approach - criticisms and suggestions are welcome.

14 comments (moderated to prevent spam):

roar said...

We have a similar requirement: The five first results comes from a slightly different query and with a different sort order than the rest.

Currently we are using lucene directly. For each external search we are generating sub-queries internally whose results are then appended.

We are now considering converting to Solr (since it now solves one of our other requirements out of the box; grouping results).

It seems that your solution would fit quite well, but I'm a bit concerned about performance. During conversion to SolrDocument you are doing:

Document doc = reader.document(hit.doc);

...for every document in the original search results. From earlier experiences I have found this to be very time consuming. Do you have some thoughts on this?

In our current lucene solution we are doing reader.document after slicing to avoid reading documents that will not be returned anyway. I believe that could be implemented in your solution as well.

Also we are adding already found document ids to subsequent query filters and thereby avoiding the alreadyFound set and finding same document multiple times. Not sure how much that affects performance though and in what direction.

Do you see other approaches we can use to meet the sub-query/sort requirement (I have considered making a custom FunctionQuery, but not sure if that would work)

Sujit Pal said...

Hi roar, thank you for the suggestions, they are good ones, I will take a look and see what I can implement.

Like you, we came to Solr from a Lucene environment. One of the reasons we are doing the reader.document(hit.doc) thing is that we needed to merge (editorially chosen, available in database table, not necessarily available in the index) records into the search results. So we cannot use the docId to dedup out subsequent records. However, with Solr, we don't have the same limitations as we did with Lucene, so we could just shove these editorially selected records into the same index with a flag. So definitely, something for me to look at, thanks for the pointer.

I also like your idea of replacing the alreadyFound set with a docId filter, need to check it out and see if its going to be feasible performance wise.

I don't know much about FunctionQuery, but depending on what the difference in your two queries are, would it make sense to use a single query with a custom Sort instead?

Anonymous said...

Hi,

Nice post!. I have few doubts regarding this

1)The prototype

doSearch1(results, searcher, q, filter, taxoResult, ndocs, req,
fields, alreadyFound);

is q the query string which need to parsed to created a complex query?

2)The results which are appended to
SolrQueryResponse rsp, beside just showing the results as it is if we would like to have control on how to show this results how to handle that?

Sujit Pal said...

Thanks. For #1, yes "q" is the input q value, ie the contents of CommonParams.Q. For #2, my use case is to just deduplicate and append results from each custom query, but if you wanted to interleave them for example, you could collect the results from each doSearchXXX() method and then interleave them.

Anupam said...

Hi Sujit,

I have a similar requirements, more precisely giving ontology/taxonomy support to an existing search engine built upon nutch and solr. But, I am not able to understand how to index the incoming data on the basis of ontology and making the search using the same ontology. There had been Ontology support on nutch 1.2 as ontology plugin, but since I am using solr as indexer and searcher. Existing plugin for nutch 1.2 wont help me out. So, I need your help , may be u can guide me in right direction.

Sujit Pal said...

Hi Anupam, if you have used the Nutch Ontology Plugin in the past, I am guessing your ontology support is limited to subclass/superclass relations? If so, would it make sense to build up a server using code similar to the Ontology Plugin that will take a search term and return the sub/super values for it if they exist? You could then call the service as part of your search. In our case, the ontology support is slightly more involved and we split it up between a heavyweight process during indexing and a lightweight process during searching.

Anupam said...

Hello Sujit,

My ontolgy right now have subclass/superclass relationships. For time being what i have created a search component in solr which returns me subclasses of a term found in query, which can be used as query refinement(thats similar to ontology plugin in nutch 1.2). This is fine if we want the query refinement on Ontology basis, though it has its limitation when we encounter multiphrase queries(which needs complex query processing). But here, the problem statement is how to make search on a unstructured data(like crawled data from nutch ) using Ontology. You suggested, the way you are spliting ontology support during indexing and searching, thats where I am facing problem. How to index data on basis of ontology, kind of fields to use(dynamic or fixed), how to tag the data etc. I am having around 20 ontologies for different domains. I hope you've understood my problem statement.

Sujit Pal said...

Hi Anupam, during indexing, we do NER on our documents, where the entities are items in our ontology. The ontology itself is a graph, with each entity being represented as a node (with an ID) and connected to other entities by weighted edges. We write our IDs along with the text into the index. During search time, we do NER on the query also, then compare IDs. This takes care of synonymy. We also do some fancy stuff using the relationship of an entity with other entities. We have two ways of tagging text with IDs - the first one is to have a sequence of key-value pairs representing ID,score for each entity found in the text at the document level - we implement this as a multi-field payload field (you can find some descriptions elsewhere on my blog). The second is where we drop the IDs into the body of the text as synonyms (using the strategy described for synonyms in the LIA book). In the first case, we need a special Similarity class and special query handlers that wrap a Payload query. In the second all we need to do is to decompose the query into IDs and text and just use a standard query.

Dean said...

Hi, thanks for the post, it's been very useful!

I wonder if it would be better if this class extended the solr SearchHandler (which also extends RequestHandlerBase). Then you've got multi-core support out of the box...

Will be trying to do this myself today.

Sujit Pal said...

Thanks Dean, glad it helped. And yes, your suggestion makes a lot of sense, its better to extend SearchHandler.

Anonymous said...

Hi Sujit,

I want to basically customize the solr's result in the dynamic bucketing format. The Query would decide how many buckets (grouping of data display) needs to be created.
I am quiet new in this. So can you please help how should I proceed.

Similar to , as mentioned in the blog , Do I need to create a structure of the response. like taxoService in the blog. Say ResultConcatenation ?

What would the taxoServcie class be composed of. Will ResultConcatenation class have code to retrive response and transform in the desired structure. Can you provide any link for the same . ?

Sujit Pal said...

Hi, the case I described takes the input query, rewrites it in different ways (optionally with the help of our taxonomy service to replace search strings with node IDs in our in-memory taxonomy graph), queries Solr with these rewritten queries, and stacks the results.

The handler takes care of pulling parameters out of the request and writing it back into the response. Each sub-handler is (or should be) relatively independent, rewriting the query and sending a search request to Solr, and returning a SolrDocumentList (we later moved to a list of docIDs for performance) and making sure docIDs which are already seen by a previous sub-handler are filtered out.

The TaxoService I refer to is a proprietary component which exposes an interface to look up a node in our graph by name and navigate the graph.

In your case, it looks like you would need to replace the linear sequence of calls to doSearchXX() in the code with some logic that uses the query to determine which searches need to be made?

Anonymous said...

hi,
we have a similar requirement.we want to send two queries to solr and merge the response into one and return the final response. I want to know how to get the search response in my custom search handler using this method.

QueryResult org.apache.solr.search.SolrIndexSearcher.search(QueryResult qr, QueryCommand cmd)

Please help.

Sujit Pal said...

I've never used QueryCommand and QueryResult, but from what I see in the docs, it would probably go like this (assuming 2 queries).

QueryResult qr = emptyQueryResult;
qr = IndexSearcher.search(
IndexSearcher.search(
qr, QueryCommand(q1)),
QueryCommand(q2));
DocListAndSet dls =
qr.getDocListAndSet();

ie, chain the two search commands to populate qr, and then get the results using getDocListAndSet(). Not sure how the merge happens though. May be worth experimenting with this, thanks for the pointer.

Also found an example of QueryResult/Command use, hopefully this helps.