When I posted my last entry I didn’t think the next one would be 4 months later! I have been extremely busy with work and haven’t had much time to experiment with anything during my spare time. I have been involved with the National Museums Online Learning Project over at the V&A in London and I am in the process of creating a federated search component across 9 national museums. I have been fortunate enough to also be involved with helping some of the project partners develop/improve their existing collection search pages on their own site.
Currently I am experimenting with content analysis or auto-tagging. I initially decided to follow in the footsteps of PHM’s collection search and use Open Calais to see what it believes to be significant in the object description. I have to admit I was a bit disappointed. I don’t think OC is suitable for museum content since it mostly looks for news-related keywords:
Entities
Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalDisaster, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL
Events and Facts
Acquisition, Alliance, AnalystEarningsEstimate, AnalystRecommendation, Bankruptcy, BonusShares, BusinessRelation, Buybacks, CompanyAffiliates, CompanyCustomer, CompanyEarningsAnnouncement, CompanyEarningsGuidance, CompanyInvestment, CompanyLegalIssues, CompanyLocation, CompanyMeeting, CompanyReorganization, CompanyTechnology, CompanyTicker, ConferenceCall, CreditRating, FamilyRelation, IPO, JointVenture, ManagementChange, Merger, MovieRelease, MusicAlbumRelease, PersonAttributes, PersonCommunication, PersonEducation, PersonPolitical, PersonPoliticalPast, PersonProfessional, PersonProfessionalPast, PersonTravel, Quotation, StockSplit
When I attempted to extract the significant keywords from the following object, I only got 2 tags back:
There are clearly other words in the description that are meaningful. What about the most obvious keyword, “dress”?!
I then tried using Yahoo’s Content Extraction service and I was much happier with the results:
Of course, the advantage of Open Calais is that it neatly groups your tags into specific categories (see the list above) but again, this being a Reuters project, it is very much news-centric and it ignores a lot of important semantic metadata we want to see with museum content.


Giv:
Tom Tague from Calais here.
First – thanks for giving Calais a shot. We’re the last people that would claim that it’s the best solution for all cases.
There may be a case of competing goals here. Our intention with Calais is to provide you with named entities (as well as facts and events). Rather than attempting to just pull out key phrases – we want to tell you whether those phrases are people, companies, technologies, geographies, etc.
This degree of extraction may be overkill for your current application – if what you are looking for is just some simple tags that can be exposed as search terms you may be better off with a general term extraction service like Yahoo’s.
On the other hand – you do give something up. If, in the future, you might want to be able to run queries / ask questions like “who are the top 100 people that appear in conjunction with the country of Italy?” – simple tags will never allow you to accomplish that.
So – my suggestion: don’t choose. Use both. Use Calais to extract named entities, use Yahoo to extract general terms, mash the two together, de-duplicate and store the results. This would give you the best of both worlds without losing the future opportunities for more sophisticated usages beyond keyword search.
You can also expect Calais’ museum collection relevancy to increase significantly over the next year or so. We’re forging a partnership with a major museum that will give us access to significant collections-relevant lexicons and entities.
Regards
Hi Tom,
Many thanks for the feedback. I completely agree and I wasn’t dismissing Calais as a viable solution. In fact, I’m planning to use it for locations. Since I am unable to extract clean locations from the Yahoo API to pass onto a mapping API like Google Maps.
Thanks again.
Giv
Giv:
Great - you might be interested in two Calais based geo-location apps that popped up this weekend. One has great code samples and documentation so it could be a good jumpstart. Both discussed here: http://contextforge.com/2008/10/developers-developers-developers/
Tom