I have spent the last few days catching up on discussions concerning federated searches across museum collections. As mentioned by Seb, creating OpenSearch feeds for collections is trivial provided you already have an existing database and the means to do a bit of coding to create the feeds. Several institutions have now managed to expose their collections via OpenSearch feeds, such as the Powerhouse Museum, Collections Australia Network and the National Maritime Museum to name a few. People like Terry Makewell are also in the process of scraping and aggregating existing collections web pages to create standard OpenSearch feeds.
Well and good. But what exactly is the point of OpenSearch if you already have your collections in a database? Why limit yourself to OpenSearch standards when you can have a rich online catalogue like PHM’s OPAC collection? Some would suggest an OpenSearch feed is more like an API so people can query and display records on their own site with very little effort. This is certainly true. CAN is doing just that. Not only can you search for objects in the CAN database, but you can also run the same query against PHM, Picture Australia and Libraries Australia.
The advantage of using OpenSearch in this context is obvious. There’s no need to replicate data locally, the owner of the database is always in control of the records and can update, add and remove them remotely and neither the feed providers nor the aggregation site need to spend time and money creating and implementing complicated APIs.
But is this how we normally search for things on the internet? Let me give you an example. Earlier today I wanted to read up on one of my favourite sociologists, Anthony Giddens. I know Giddens works at the LSE in London so I could go to the LSE site and look for his staff page there. Or I could go to Wikipedia and do a search there. OR I could just do a Google search. I want the most detailed and relevant pages and I don’t want my results presented individually. I really don’t care where the result comes from because I obviously don’t know which site has the “best” page for my query.
This is my main concern with OpenSearch when it comes to working with multiple feeds. It is certainly a step in the right direction but showing individual result-sets doesn’t seem like the right approach. If I do a search for “chair” on the CAN search engine, I see a list of chairs for CAN, then I can see a separate list for PHM and so on. What I really want is one big list of chairs, ranked and sorted by importance and relevance. Sure, I should also be able to just search a specific collection database, but I can most likely do that by visiting the collection’s own website.
What we need here is a federated search across museums using disparate OpenSearch feeds. But this is easier said than done. It would be one thing if you had a centralised database that would collate all the feeds but we don’t want to do that. Why? Because we would be creating duplicate records, they would be out of date and the database would be too big to manage and update.
Experiment 1: Collate, Index, Query

Goal: Send a query to 2 different OpenSearch collections and lists all the objects combined and ranked by relevance.
To achieve this goal, I begin by first grabbing the two OpenSearch result-set and combining them into a single array object. This list is then fed into Lucene and a temporary index directory is created on the server.
Note: there is no local database used in this experiment.
Once the items are indexed we can run the same query string against the index and get a ranked and sorted list to display.
We are merging, indexing and ranking all in real-time and the entire process only takes 2-5 seconds on average. Naturally the response time is dependent on the speed at which we can download the OpenSearch feeds but the indexing itself is relatively quick and querying the index is lightening fast.

Lucene is a heavyweight indexing engine and allows you to perform complex queries against the OpenSearch collection objects. You can do wild-cards, proximity search and the ability to group query strings using quotes.
For this particular example, I indexed the object title and description fields and Lucene uses these values to work out which record is more important based on the text in those fields. For the ultra nerdy, here’s the actual formula we are running against the records to obtain a relevance score:
score = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
where:
score: score for record
sum_t: sum for all terms t
tf_q: the square root of the frequency of t in the query
tf_d: the square root of the frequency of t in d
idf_t: log(numDocs/docFreq_t+1) + 1.0
numDocs: number of documents in index
docFreq_t: number of documents containing t
norm_q: sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t: square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of terms in query
Fun! Isn’t it?
A single ranked list of objects from two different OpenSource feeds (PHM & CAN). The search term was “compass”.
I will keep working on this prototype so you can try it out yourself in the future. I will also provide some performance benchmarks.
Conclusion: This experiment shows an efficient way to combine disparate data sources and combine them to create a single meaningful result-set. Much the same way Google searches and displays results. The accuracy of the indexing engine is only as good as the data fed into it and this can be left up to the data provider. For example, if you have tags associated with each collection object in your database, it would probably be a good idea to include them at the end of your object descriptions so that the index has more information to work with.
I should also mention that this method doesn’t have to be limited to OpenSearch. It can be used with any data source like OAI or even your own local database.
Finally, there have been a lot of discussions on the merits of creating custom search engines vs using Google. I personally don’t have any strong views for either solution and I’ll leave that for other forums. But I do believe a solution like the one mentioned here makes much more sense in the context of sites like CAN or anyone else who wants more control over what is searched and how the results are displayed.
I applaud the experiment: that’s a very clever take on combining and ranking disparate result sets. I think speed would be an issue for any real-life usage, but overall this sounds similar to what Terry Makewell described for their system - although I don’t think they were using Lucene?
I’m in the middle of a project at the Walker that will eventually (soon!) expose an OAI-PMH repository of our collection metadata, I’ll be sure to send you a link to play with once it’s up. Hopefully something cool can come out of that!
Great start to the blog!
Hi Giv
Picked your site up via Nate’s blog. I thought you might be interested in the stuff I’ve put together over on http://www.mashedmuseum.org.uk - I’m gently encouraging a “community of practice” around museum programmatic data access - lightweight, RESTful, RSS-feedy stuff. There is a very initial list of data sources there plus a Google Group and so on.
tfn
Mike
Thanks, Nate. Yes I realise there is a scalability issue here for doing real-time indexing. In future experiments I will work on doing the indexing separately so the only thing the UI is interacting with is the index and that should be super fast. The main challenge there will be updating the index on a regular basis and speed.
Thanks for your inspiring experiment, can I use the image for my slides with reference to this article? The only thing that bothers me is that you call “federated search” what in fact is a meta search with additional result merging. In federated search, all particular sources have agreed on a common standard how to index documents, so you can easily combine the results without re-ranking. In your example there is no federation among the RSS-Feeds. This use of the term “federated search” always confuses people and in the end you have to explain in detail what kind of search you do - because people tend to use trendy terms without common definition.
Hi Jakob. I understand what you mean about the use of “federated” in this context and please feel free to use the material on this blog.