Author Archive for Giv P

Facebook Connect & Museums

I have mixed feelings about Facebook Connect, but let’s talk about the pros first.

In case you haven’t heard, Facebook officially released their Connect feature this week and it allows any external site to take advantage of FB’s authentication system and full API. This was only possible through FB applications in the past but now you can open up your own site to over 120 million active FB users. They can log in without having to go through a registration process and you will be able to pull-in almost every information available on their FB account such as email address, avatar, status message, photos and a list of all their friends.

The potentials are obvious. But how easy/hard is it to apply this to an existing application on your site and how useful is it to the museum community? So I got my hands dirty and dug straight into the documentation. I used an existing framework (Symfony) that I have used to develop a collections application. The challenge was to first integrate Facebook Connect on top of Symfony’s own authentication system and the second challenge was to use FB’s API to extract information that I can use on a museum collection website.

And voila! it took 10 mins to set up the authentication system and another 20 mins to do the rest. FB Connect is ridiculously easy to implement and once you’re authenticated, you can use their PHP classes to make calls to their API and get virtually any information you want about the user.

Here’s my information on a museum collections site after I have been authenticated

FB Connect

The possibilities are endless. You have the user’s email address, so for example, if they wanted to purchase an image, you could take them through an e-commerce system or do something fun like let them add the object as a favourite and publish the image and object details on their Facebook profile or update their status automatically.

What makes me uncomfortable about FB Connect is that you are depending on a single site. It’s not like OpenId and it’s hard to know whether FB will be relevant in 5 years time. So it may be best to keep your own sign-up process and add FB on top as a feature.

The other factor that makes me worry is that you don’t even have to go to Facebook to log in, unlike OpenId. You are presented with a DHTML login window directly on your own site. So what’s stopping people from making a bogus login window and storing your login details? Surely I’ve missed something there.

But over all, I have to admit I am really impressed :)

Yahoo Search Monkey for Collections

I have to admit I can’t remember the last time I used Yahoo as a search engine. But I recently looked into Yahoo’s Search Monkey and was very impressed. Normally, you create your web page, Google picks it up and provided you’ve got a proper HTML title, meta tags etc, Google will display the title and description in its search results. Well and good. But wouldn’t it be nice if you had some control over how the search results were displayed?

Have a look at the search results for “The Smith” in Yahoo’s search engine. Notice how the Last.fm results display extra information such as genre, similar artists and even a thumbnail:

I think this is pretty amazing and it would be perfect for museum collection objects.

Here’s a result from the British Museum with the default search engine format:

It doesn’t really look any different from any other search record. So let’s beautify it a bit by showing the object thumbnail, the dimensions and the owner:

Much nicer. This took me just a few minutes to achieve and you can pretty much put anything you like in these result blocks. It doesn’t require any extra set-up from your end and all you need to do is to tell Yahoo which elements on the page you want to display using XPath.

There’s obviously a lot more you can do with Search Monkey. You can set up patterns to apply the same format to all your pages and you can even use your own custom data feeds to display information that doesn’t exists on the page.

I’m hoping Google will do something similar very soon.

Content analysis and auto-tagging

When I posted my last entry I didn’t think the next one would be 4 months later! I have been extremely busy with work and haven’t had much time to experiment with anything during my spare time. I have been involved with the National Museums Online Learning Project over at the V&A in London and I am in the process of creating a federated search component across 9 national museums. I have been fortunate enough to also be involved with helping some of the project partners develop/improve their existing collection search pages on their own site.

Currently I am experimenting with content analysis or auto-tagging. I initially decided to follow in the footsteps of PHM’s collection search and use Open Calais to see what it believes to be significant in the object description. I have to admit I was a bit disappointed. I don’t think OC is suitable for museum content since it mostly looks for news-related keywords:

Entities
Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalDisaster, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL

Events and Facts
Acquisition, Alliance, AnalystEarningsEstimate, AnalystRecommendation, Bankruptcy, BonusShares, BusinessRelation, Buybacks, CompanyAffiliates, CompanyCustomer, CompanyEarningsAnnouncement, CompanyEarningsGuidance, CompanyInvestment, CompanyLegalIssues, CompanyLocation, CompanyMeeting, CompanyReorganization, CompanyTechnology, CompanyTicker, ConferenceCall, CreditRating, FamilyRelation, IPO, JointVenture, ManagementChange, Merger, MovieRelease, MusicAlbumRelease, PersonAttributes, PersonCommunication, PersonEducation, PersonPolitical, PersonPoliticalPast, PersonProfessional, PersonProfessionalPast, PersonTravel, Quotation, StockSplit

When I attempted to extract the significant keywords from the following object, I only got 2 tags back:

There are clearly other words in the description that are meaningful. What about the most obvious keyword, “dress”?!

I then tried using Yahoo’s Content Extraction service and I was much happier with the results:

Of course, the advantage of Open Calais is that it neatly groups your tags into specific categories (see the list above) but again, this being a Reuters project, it is very much news-centric and it ignores a lot of important semantic metadata we want to see with museum content.

Image recognition and museum collections

I have been trying to think of a clever but relatively simple way to use characteristics derived from photos to search for similar photos/objects in a database. I wanted this process to be completely automated the same way Open Calais uses keywords to tag and connect blocks of text.

One obvious way is to use a tool similar to Flickr’s notes and allow users to select portions of the photo and tag or label them. We can then search the database for similar tags in photos and establish a network. But this requires a fair bit of work and I’m not sure a lot of users would bother using this tool.

So how do you automate this process without using complex image recognition software and an MIT degree? Simple answer, you probably can’t. But we can do something else that doesn’t require a lot of work, it is completely automated, the tools are free and it allows you to connect your collection objects through a medium other than the standard tags and text search.

Colour!
I got this idea from an experimental tool created by Jim Bumgardner a while ago and decided to create my own version to use against museum collections.

Take this image from the Walker Art Center database for example:

It would be nice to display say 4 of the most commonly used colours in this photo and then allow users to search for other objects with the same colours. Here’s what I did:

1. Dither image down to 256 colours
2. Cycle through every single pixel, and extract RGB values
3. Omit white colours (since most objects will have the colour white)
4. Omit duplicate colours from the spectrum
5. Get an average of all the remaining colours and grab the top 4 colours

Now we have 4 of the most prominent colours in this photo and here they are:

We can now store these colour values in the database along with other information about the object.

How do we use the colour information?
We can either store the hex values of the image and then allow users to search by colour. Eg. “show me all objects that contain the hex colour #CC3399.

Or we can have more intelligent colour searches by storing the RGB values and performing a proximity search by selecting objects that contain the colours close to #CC3399. For example, we search for the colour R: 100, G: 100, B: 100. But instead of getting an exact colour match, we can allow the database to use an approximate value of +-10 for each colour channel so it finds similar shades of the requested colour.

For the sake of experimentation, I wanted to go one step further. It’s nice having the option to display objects by a single colour, but what if I wanted to search for all objects that have a dark blue in the middle only? Now this is starting to look more serious. The example image above has a lot of blue in it but the strongest shades of blue appear horizontally in the middle. So we should be able to search for objects with a similar pattern. Here’s how I tackled this challenge.

1. Slice the image into 16 segments. This is just an arbitrary number. 4 didn’t seem like it was enough and 32 was too much.
2. Cycle through the pixels of each segment and store a SINGLE average RGB value for that segment i.e. the most commonly used colour in that segment.
3. Store the value in the database associating the colour with the segment #1.
4. Repeat the process for all 16 segments and store in database.

Once we are finished, this is the information we end up with:

It looks like a low-resolution version of the original image and that’s exactly what it is. Except we now know which colour is used in which part of the image. Now using the proximity search method mentioned earlier and with some fancy SQL queries, we should be able to not only find colour matches but also patterns. The colour data here should return all objects that are predominantly blue where the darkest shades are in the middle of the image.

Conclusion
The processes above seem like the CPU is doing a lot of work; but it really isn’t. I have tested this process on several images and the entire analysis for each image took less than 1 second.

It may not be necessary to use the second method at all. Doing a basic colour search may be sufficient in most cases. Having the ability to search a collection database by colour is pretty powerful in my opinion and adding this extra metadata to your objects won’t require much work at all!

We have OpenSearch, now what?

I have spent the last few days catching up on discussions concerning federated searches across museum collections. As mentioned by Seb, creating OpenSearch feeds for collections is trivial provided you already have an existing database and the means to do a bit of coding to create the feeds. Several institutions have now managed to expose their collections via OpenSearch feeds, such as the Powerhouse Museum, Collections Australia Network and the National Maritime Museum to name a few. People like Terry Makewell are also in the process of scraping and aggregating existing collections web pages to create standard OpenSearch feeds.

Well and good. But what exactly is the point of OpenSearch if you already have your collections in a database? Why limit yourself to OpenSearch standards when you can have a rich online catalogue like PHM’s OPAC collection? Some would suggest an OpenSearch feed is more like an API so people can query and display records on their own site with very little effort. This is certainly true. CAN is doing just that. Not only can you search for objects in the CAN database, but you can also run the same query against PHM, Picture Australia and Libraries Australia.

The advantage of using OpenSearch in this context is obvious. There’s no need to replicate data locally, the owner of the database is always in control of the records and can update, add and remove them remotely and neither the feed providers nor the aggregation site need to spend time and money creating and implementing complicated APIs.

But is this how we normally search for things on the internet? Let me give you an example. Earlier today I wanted to read up on one of my favourite sociologists, Anthony Giddens. I know Giddens works at the LSE in London so I could go to the LSE site and look for his staff page there. Or I could go to Wikipedia and do a search there. OR I could just do a Google search. I want the most detailed and relevant pages and I don’t want my results presented individually. I really don’t care where the result comes from because I obviously don’t know which site has the “best” page for my query.

This is my main concern with OpenSearch when it comes to working with multiple feeds. It is certainly a step in the right direction but showing individual result-sets doesn’t seem like the right approach. If I do a search for “chair” on the CAN search engine, I see a list of chairs for CAN, then I can see a separate list for PHM and so on. What I really want is one big list of chairs, ranked and sorted by importance and relevance. Sure, I should also be able to just search a specific collection database, but I can most likely do that by visiting the collection’s own website.

What we need here is a federated search across museums using disparate OpenSearch feeds. But this is easier said than done. It would be one thing if you had a centralised database that would collate all the feeds but we don’t want to do that. Why? Because we would be creating duplicate records, they would be out of date and the database would be too big to manage and update.

Experiment 1: Collate, Index, Query

federated OpenSearch

Goal: Send a query to 2 different OpenSearch collections and lists all the objects combined and ranked by relevance.

To achieve this goal, I begin by first grabbing the two OpenSearch result-set and combining them into a single array object. This list is then fed into Lucene and a temporary index directory is created on the server.

Note: there is no local database used in this experiment.

Once the items are indexed we can run the same query string against the index and get a ranked and sorted list to display.

We are merging, indexing and ranking all in real-time and the entire process only takes 2-5 seconds on average. Naturally the response time is dependent on the speed at which we can download the OpenSearch feeds but the indexing itself is relatively quick and querying the index is lightening fast.

federated OpenSearch

Lucene is a heavyweight indexing engine and allows you to perform complex queries against the OpenSearch collection objects. You can do wild-cards, proximity search and the ability to group query strings using quotes.

For this particular example, I indexed the object title and description fields and Lucene uses these values to work out which record is more important based on the text in those fields. For the ultra nerdy, here’s the actual formula we are running against the records to obtain a relevance score:


score = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d

where:

score: score for record
sum_t: sum for all terms t
tf_q: the square root of the frequency of t in the query
tf_d: the square root of the frequency of t in d
idf_t: log(numDocs/docFreq_t+1) + 1.0
numDocs: number of documents in index
docFreq_t: number of documents containing t
norm_q: sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t: square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of terms in query

Fun! Isn’t it?

And here’s the final result

A single ranked list of objects from two different OpenSource feeds (PHM & CAN). The search term was “compass”.

I will keep working on this prototype so you can try it out yourself in the future. I will also provide some performance benchmarks.

Conclusion: This experiment shows an efficient way to combine disparate data sources and combine them to create a single meaningful result-set. Much the same way Google searches and displays results. The accuracy of the indexing engine is only as good as the data fed into it and this can be left up to the data provider. For example, if you have tags associated with each collection object in your database, it would probably be a good idea to include them at the end of your object descriptions so that the index has more information to work with.

I should also mention that this method doesn’t have to be limited to OpenSearch. It can be used with any data source like OAI or even your own local database.

Finally, there have been a lot of discussions on the merits of creating custom search engines vs using Google. I personally don’t have any strong views for either solution and I’ll leave that for other forums. But I do believe a solution like the one mentioned here makes much more sense in the context of sites like CAN or anyone else who wants more control over what is searched and how the results are displayed.

Introductions

I should begin this first post by providing an explanation of the purpose of this site and a brief background into what prompted me to start this project and what I plan to achieve. I have outlined a brief context in the About section.

I didn’t want to create a new blog discussing new initiatives in the cultural sector. These topics are already being discussed by experts in the arena. e.g. The MW08 conference and excellent blogs such as the New Media Initiative and Sebastian Chan’s Fresh + New . Instead what I would like to do here is to draw on existing research and provide mostly a technical perspective into tackling some of these issues. Such as aggregation, scrapping, federated search, speed, scalability, UI and use of social data.

I will be experimenting with existing museum data sources and different technologies to hopefully start a creative discussion between developers currently working on similar issues. I will attempt to describe everything in detail and I hope readers will contribute to my posts by providing queries and feedback.