Tag Archive for 'collections australia network'

We have OpenSearch, now what?

I have spent the last few days catching up on discussions concerning federated searches across museum collections. As mentioned by Seb, creating OpenSearch feeds for collections is trivial provided you already have an existing database and the means to do a bit of coding to create the feeds. Several institutions have now managed to expose their collections via OpenSearch feeds, such as the Powerhouse Museum, Collections Australia Network and the National Maritime Museum to name a few. People like Terry Makewell are also in the process of scraping and aggregating existing collections web pages to create standard OpenSearch feeds.

Well and good. But what exactly is the point of OpenSearch if you already have your collections in a database? Why limit yourself to OpenSearch standards when you can have a rich online catalogue like PHM’s OPAC collection? Some would suggest an OpenSearch feed is more like an API so people can query and display records on their own site with very little effort. This is certainly true. CAN is doing just that. Not only can you search for objects in the CAN database, but you can also run the same query against PHM, Picture Australia and Libraries Australia.

The advantage of using OpenSearch in this context is obvious. There’s no need to replicate data locally, the owner of the database is always in control of the records and can update, add and remove them remotely and neither the feed providers nor the aggregation site need to spend time and money creating and implementing complicated APIs.

But is this how we normally search for things on the internet? Let me give you an example. Earlier today I wanted to read up on one of my favourite sociologists, Anthony Giddens. I know Giddens works at the LSE in London so I could go to the LSE site and look for his staff page there. Or I could go to Wikipedia and do a search there. OR I could just do a Google search. I want the most detailed and relevant pages and I don’t want my results presented individually. I really don’t care where the result comes from because I obviously don’t know which site has the “best” page for my query.

This is my main concern with OpenSearch when it comes to working with multiple feeds. It is certainly a step in the right direction but showing individual result-sets doesn’t seem like the right approach. If I do a search for “chair” on the CAN search engine, I see a list of chairs for CAN, then I can see a separate list for PHM and so on. What I really want is one big list of chairs, ranked and sorted by importance and relevance. Sure, I should also be able to just search a specific collection database, but I can most likely do that by visiting the collection’s own website.

What we need here is a federated search across museums using disparate OpenSearch feeds. But this is easier said than done. It would be one thing if you had a centralised database that would collate all the feeds but we don’t want to do that. Why? Because we would be creating duplicate records, they would be out of date and the database would be too big to manage and update.

Experiment 1: Collate, Index, Query

federated OpenSearch

Goal: Send a query to 2 different OpenSearch collections and lists all the objects combined and ranked by relevance.

To achieve this goal, I begin by first grabbing the two OpenSearch result-set and combining them into a single array object. This list is then fed into Lucene and a temporary index directory is created on the server.

Note: there is no local database used in this experiment.

Once the items are indexed we can run the same query string against the index and get a ranked and sorted list to display.

We are merging, indexing and ranking all in real-time and the entire process only takes 2-5 seconds on average. Naturally the response time is dependent on the speed at which we can download the OpenSearch feeds but the indexing itself is relatively quick and querying the index is lightening fast.

federated OpenSearch

Lucene is a heavyweight indexing engine and allows you to perform complex queries against the OpenSearch collection objects. You can do wild-cards, proximity search and the ability to group query strings using quotes.

For this particular example, I indexed the object title and description fields and Lucene uses these values to work out which record is more important based on the text in those fields. For the ultra nerdy, here’s the actual formula we are running against the records to obtain a relevance score:


score = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d

where:

score: score for record
sum_t: sum for all terms t
tf_q: the square root of the frequency of t in the query
tf_d: the square root of the frequency of t in d
idf_t: log(numDocs/docFreq_t+1) + 1.0
numDocs: number of documents in index
docFreq_t: number of documents containing t
norm_q: sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t: square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of terms in query

Fun! Isn’t it?

And here’s the final result

A single ranked list of objects from two different OpenSource feeds (PHM & CAN). The search term was “compass”.

I will keep working on this prototype so you can try it out yourself in the future. I will also provide some performance benchmarks.

Conclusion: This experiment shows an efficient way to combine disparate data sources and combine them to create a single meaningful result-set. Much the same way Google searches and displays results. The accuracy of the indexing engine is only as good as the data fed into it and this can be left up to the data provider. For example, if you have tags associated with each collection object in your database, it would probably be a good idea to include them at the end of your object descriptions so that the index has more information to work with.

I should also mention that this method doesn’t have to be limited to OpenSearch. It can be used with any data source like OAI or even your own local database.

Finally, there have been a lot of discussions on the merits of creating custom search engines vs using Google. I personally don’t have any strong views for either solution and I’ll leave that for other forums. But I do believe a solution like the one mentioned here makes much more sense in the context of sites like CAN or anyone else who wants more control over what is searched and how the results are displayed.