Making Your Spider Outperform Google Enhancing Search Precision
Making Your Spider Outperform Google Enhancing Search Precision through Manually-Selected Best Bets Richard Wiggins Michigan State University [email protected] May 2003 Poll: Who owns a piece of luggage with built-in wheels? Wheel was invented thousands of years ago We just figured out how to add them to luggage about 20 years ago
Also needed enabling technology of the inline skate wheel Innovations Are Obvious: After You Innovate The Accidental Thesaurus: a thesaurus driven purely from log analysis The accidental thesaurus represents the wheels you should add to your search engine Thesis
By analyzing search logs, you engage in a conversation with your customers At best, its a two way conversation: Your users tell you what they seek You tune your search engine (and your site) to give them what they seek the most Search is too important to leave in the hands of robots Access 98 Conference Proposal:
The Accidental Thesaurus A Modest Proposal: The Accidental Thesaurus For intranet, online product catalog, newspaper, campus sites Build a thesaurus based on what people look for Dont even try to be comprehensive Use your search logs to find what people look for -- and how they actually search Fuzzy matching of user searches against thesaurus, a la AskJeeves Agenda: The Accidental Thesaurus
Why We Needed It How We Built It Why It Works So Well Fellow Travelers: Best Bets Conclusions Why We Needed It The Wonderful Things Search Engines Do Help harness massive amounts of content
Thousands, millions, billions of URLs Cut across barriers Document structure Topical structure Institutional structure The Horrible Things Search Engines Do Confuse low-value content with vital content
And obsolete content And humorous content And draft, internal, duplicative content Rank leaf pages ahead of starting points Rank popular or personal pages ahead of official content How People Approach a Home Page
Some people just start clicking -- browsers Others look for a search box and immediately type a search word or phrase -- searchers Browsers convert to searchers when: Too many mouse clicks fail to yield result Information architecture is poor Therefore, search engines serve two groups of users:
Those who prefer to search in the first place Those who have tried browsing, and are now somewhat frustrated What Users Expect of Search Engines They type in the word(s) they think of, not the labels we assign They expect official sites to float to the top of the hit list
"Jobs" instead of "Employment Office" If its not in the first 10 hits, often the user gives up They expect complete coverage They expect disambiguation "Human Resources" Finds the right office at MSU or degree programs They do not iterate or use complicated syntax
Searching Webspace at Michigan State University www.msu.edu is a big, complicated space MSU Webspace in general is much, much more complicated Hundreds of official servers Perhaps 2,000,000 URLs including many personal pages
Result: browsing to find the page you want is ever more futile Both browsers and searchers increasingly frustrated One user view of search.msu.edu: Academics application for graduation overseas study ordering catalog School of Music
Computer Science human ecology department psychology 101 Another user view of search.msu.edu: Virtual Library DNA sequencing climate change beam theory feline brain tumor PRL and sequencing Another user view of
search.msu.edu: Extension livestock pavilion wildlife fisheries bathtub removal and installation Round Bale Storage Google Versus AltaVista
MSU AltaVista launched in 1996 Google indexed MSU in 2000 Google is now default search engine at hundreds of universities Google exploits simplicity Heavy weighting on link popularity Assumed "and" among operators Result: desired item much more likely to appear at top of hit list MSU AltaVista vs. Googles MSU Index Search Phrase MSU AltaVista Google.com/msu (entered as-is, no quotes,
no caps) hit list position human resources 2 1 breslin center 12 1 admissions 1
1 sky calendar 3 2 manual of business procedures 6 1 anatomy 35
1 blackboard 8 5 orientation Not in index. 2 remote sensing Not in index.
2 hit list position MSU AltaVista Search: Grades Student wants to find semester grades None of these relevant! MSU Google: grades Even the mighty Google cant find the best grades site Hell Hath No Fury Like a
Content Provider Scorned You think your users get upset when they cant find things Your content providers are really full of rage We decided we needed to do something To help our users And to mollify our content providers
How About a Registered Keywords Approach? Student needs to find the Registrars home page Registrar wants students to find the home page Lets just map a search for registrar to a known, hand-picked URL A la AOL Keywords AOL Keywords Example:
Search for survivor and you get ESPN Keywords Example: Why Not Pick the Most Popular Keywords First? Number of Searches Key Words 15905 4718 3804 3700 1528 1501 1417 1360 1270 1209 1142
1105 1094 950 950 895 874 849 822 twig pilot blackboard twig login employment human resources stuinfo Study abroad email pilot email
jobs olin housing capa tuition monstertrak transcripts Library Breslin Center Notes Hard to find e-mail service from MSU home page Course management system student, staff, faculty, grad Health Center Online testing system Online jobs searches
basketball arena Note: These searches are for starting points! How We Did It MSU Keywords Features We store popular search phrases into a database:
We map key words to the best URLs For each popular search phrase, we look up the best URL, and enter into database We manage it all through a Web interface What the User Sees When user searches, we query the
database, then query search engine Present results for both on one screen What we hand-pick is always at the top of the hit list User Searches for grades Stuinfo is the place where a student finds grades MSU Keywords: Functional View Functional View business library 1 Search Handler
MSU Search Logger keywords.msu.edu User enters search phrase at search.msu.edu or www.msu.edu. Search Handler sends phrase to MSU Keywords for database search; if match, hits returned to handler. Search Handler queries MSU AltaVista via Applications Programming Interface. Results from MSU Keywords merged with MSU AltaVista hits and sent to user. Search phrase is sent asynchronously via XML to MSU Search Logger database. Web-Based Management Why It Works So Well Evidence that MSU Keywords Helps
Fewer complaints from users Far fewer complaints I cant find how to apply for a job Fewer complaints from content providers Positive feedback from both Testing confirms that people do use MSU Keywords Backwards Scientific Method
First build the thing Find out it works well Now form hypothesis as to why What Search Logs Reveal Michigan State University: Unique Searches by Rank 3500 3000 2500 2000 1500 1000 500 0 1
81 161 241 321 401 481 561 641 721 801 881 961 Rank A Classic Zipf Distribution Most commonly-used search phrases Least commonly-used search phrases Why the Approach Works So Well To understand success, you must understand
the Zipf curve A small number of unique search phrases Out of 200,000 searches: accounts for a large number of all searches performed The top 500 account for 40%! The top 1000 account for 50%! A database with only 1000 entries can assist your customers with 50% of their searches
How Big Should Your Accidental Thesaurus Be? Percent Coverage Out of 200,000 searches at msu.edu: Unique Key Words /Needed 10 14
Google and Best Bets 50% of unique searches are rarely entered MSU Keywords works great for popular searches Googles relevancy works great for the uncommon search calculate GPA student change of address
guest policy Fellow Travelers Bristol-Myers Squibb Built a best bets service same time we did Served intranet needs Federated search with two existing intranet search engines Very successful
Inspired by work information architect Vivian Bliss has done at Microsoft Work by Mike Rogers, Lydia Bauer, et al BBC Best Links Techstreet Ann Arbor, Michigan based company Sells engineering standards online 90% of site visitors enter a search immediately
A technical standard A Techstreet document number An area of interest Techstreet Log Analysis Distributionof UniqueSearchTerm s-Techstreet.com 350 300 250 200 150 100 50 0 1 38 75 112149186223260297334 Rank Number
Key Words 315 astm 188 standards 161 api 141 steel
138 water 126 test 125 standard 117 code 116 concrete
99 design 91 systems 84 electrical 81 handbook 79
power Other Universities 469 430 391 352 313 274 235 196
157 118 1400 1200 1000 800 600 400 200 0 79
Distribution of Unique Search Phrases, Northwestern University, September 2002 40 Compared with Ohio State and Northwestern Same curve Similar search terms Similar rankings! But vastly different Web
sites 1 Conclusions Every Web Site Can Benefit At least any non-trivial site Listen to your customers Tune your search engine to deliver the right results for the high-frequency part of your Zipf curve
Best Bets Also Serves What the Institution Wants to Convey Breaking news New product Management of bad news Dealing with controversy Example: graduate students want to form a union
We added the U position to MSU Keywords Also the Unions position We drove traffic to their Web site! Pro-Active Best Bets Think Pro-Actively When breaking news occurs people will come to your Web site and search for information Think the way they would
Every press release should be examined for Best Bets material University of Alabama: Coach Fired, Google Still Doesnt Know Coach was fired May 5 Today is May 7 Google/ua points to story from May 1 Challenges Guard against overpopulating Keywords can scale indefinitely
A-Z index can only scale so far But the "good" words need to map to what users want not what MSU webmasters want! .com analogy Judicious use of non-public aliases helps Too many cooks
Need single editorial "voice", style sheet Need consistency New Interface: Integrate with Google Question: Is This Rational Behavior? Remember: Search is too important to leave in the hands of robots
The search experience you deliver is part of your information architecture You can control the top of the hit list Netfact.com/present Credits MSU Keywords conceived by Richard Wiggins Implemented by Mathew Schuster
Recent revisions by Ryan Simmons and Mike Zakhem Other Best Bets projects conceived and implemented by wise people at many places
American Logistics Association Military Market Facts July 2008 American Logistics Association Our History 1920 Incorporated as the Quartermaster's Association (QMA) QMA was composed of officers of the Regular Army, the National Guard, the Organized Reserves, key civilians in the federal...
Searching EMBASE Helen Rowlandson Principal Pharmacist Medicines Information London Medicines Information Service (Northwick Park) London Session aims To understand the concepts of a basic search strategy in Embase To describe the EMTREE thesaurus in detail so that you will be...
Region of Oscillation [K]o-high and constant, tonic firing. Various bursting behaviors [K]o elevated, Vm = -62mV. Solid curves representHopf bifurcations which define a region within which [K]o oscillates because the only stable attractor is a periodic orbit. Hence, the parameters...
EEL Online System Through WOW. Before an application is uploaded via the WOW portal, a payment must be made online. Please do not upload an application until you have verification of payment for the application. This is the same concept...
Halloween is a yearly holiday observed around the world in October. According to some scholars, All Hallows' Eve initially incorporated traditions from pagan harvest festivals and festivals honoring the dead, particularly the Celtic Samhain;other scholars maintain that the feast originated...
Measuring your performance. One constant preoccupation for the entrepreneur is the measuring of economic performance. Increasingly, stakeholders and the public are expecting entrepreneurs to show that they are not merely delivering economic value, but also following socially and environmentally responsible...
CNS Malformations SCOTT KULICH, M.D., Ph.D. ... firm gyri Candle gutterings SEGA Failure of closure of the anterior neuropore Common malformation Frog-like facies Area cerebrovasculosa Underdeveloped hypothalamus Adrenal cortical hyperplasia Multifactorial-Folic acid supplementation occulta ...
Ready to download the document? Go ahead and hit continue!