Pittsburgh, PA
Wednesday
September 19, 2018
    News           Sports           Lifestyle           Classifieds           About Us
Health & Science
 
Place an Ad
Running Calendar
Travel Getaways
Headlines by E-mail
Home >  Health & Science >  Science Printer-friendly versionE-mail this story
CMU trio develops Internet search tool that sorts results in helpful clusters

Monday, June 23, 2003

By Karen Hoffmann, Post-Gazette Staff Writer

Google has muscled its way to the top of the heap among Internet search engines by ranking its results according to more than 100 factors.

THE CREATORS: From left, Carnegie Mellon University graduate student Chris Palmer, Raul Valdes-Perez and postdoctoral student Jerome Pesenti. (Andy Starnes, Post-Gazette)

But the popular service still ends up producing a single, long list of Web sites that may not be topped by the results that are most useful to someone searching the net.

Now, a local company has produced a new kind of Internet tool that it thinks might be the next step in sorting through the profusion of data that exists on the Web.

The company, Vivisimo, has developed a "clustering engine" that takes the top results from several common search engines and puts them in categories that can help people find the kind of information that is most relevant to them.

For example, a Google search on the word "cell" produces nearly 23 million results. Of the first 30, all but four relate to biological cells, and those rankings are based on how many other Web sites are linked to the Web sites listed, the content of the pages and other factors.

Vivisimo, on the other hand, takes up to 500 top sites from several Internet search engines and puts them in such categories as cell biology, cell phone, fuel cell, stem cell and "Splinter Cell review," which turns out to be analyses of a new video game called Splinter Cell.

Similarly, the Google search for "Pittsburgh" yields 6.89 million hits, and the top 10 are scattered among tourist, sports team and newspaper sites. Vivisimo breaks the "Pittsburgh" search into such categories as "hotels," "music," and "real estate," making it easier for Web users to find what they're looking for.

Raul Valdes-Perez, co-founder of Vivisimo, compared the Web to a bookstore.

With a traditional search engine, the books are piled haphazardly on the floor, he said. With Vivisimo, they are neatly stacked on the shelves.

"We're going to change the way people see masses of information," said Valdes-Perez. "Why should people be satisfied with the inefficiencies of seeing information in a disorganized way?"

Others are beginning to notice the Pittsburgh invention.

In January, SearchEngine-Watch.com, a site that analyzes and offers advice on search engines, named Vivisimo the "Best Meta Search Engine" of 2002. PC Magazine calls it a "cool alternative to Google" in this month's issue.

"Clustering of results is the next step up from ranking of documents, which almost all search engines do now," said Jim Jansen, assistant professor of information sciences and technology at Penn State University.

He predicted that search engines will eventually move toward some type of clustering. "The rub in this is clustering the documents correctly," he said.

Valdes-Perez agrees: "How to do that wisely is the biggie challenge."

In Vivisimo's case, he said, the developers have used four criteria for creating cluster categories: Making the titles concise, accurate, distinctive, and "humanlike" -- in other words, not something that looks like it was generated by a machine.

Vivisimo's mathematical algorithms try to do two things in creating the categories. First, they look at the title of a Web site or a summary of what's on it to figure out what the content is about. Next, they use a knowledge database of synonyms, abbreviations and different forms of words to put sites in the most relevant category.

In the Vivisimo ranking of the word "cell," for instance, the algorithms grouped together sites that referred to "cell phones" and "cellular phones" by recognizing the equivalence of the words cell and cellular.

Clustering is not a brand-new technique.

It has been around for about a decade, but Valdes-Perez said previous clustering engines tended to produce titles that were too long and complicated to be of much use. Those engines would create groups of Web sites and then choose titles, using such techniques as the seven most popular words in the group of documents.

The Vivisimo group tried instead to develop titles that were concise and to the point at the same time the engine was grouping the Web sites in categories.

The company is a spinoff from research begun in the Carnegie Mellon University computer science department, from which Valdes-Perez is now on leave. He developed the Vivisimo algorithms at Carnegie Mellon through his work and the efforts of postdoctoral student Jerome Pesenti and graduate student Chris Palmer.

When Valdes-Perez and Pesenti came up with a way to create better categories, they were so excited that "we never published the algorithms" in scientific journals, said Valdes-Perez. "We went straight to tech transfer and said we wanted to start a company."

The researchers founded the firm in their apartments in June 2000.

The company makes its money by licensing the clustering engine to organizations for use on their sites or databases. Among those that have signed up so far are NASA and Stanford University's HighWire Press.

Valdes-Perez called HighWire "hypercool" because it contains most of the world's biomedical information, which Vivisimo sorts into categories. The Journal of the American Medical Association recently signed up for Vivisimo through HighWire, as have 10 biomedical societies.

One Vivisimo client found the clustering engine helped him personally.

"I was actually using it myself to do a literature search after I was diagnosed with cancer, and I found an article using Vivisimo that my oncologist had not seen," said Michael Clarke, senior managing editor of the Division of Medical Journals and Professional Periodicals of the American Academy of Pediatrics. "That ended up helping us come up with a treatment decision that most likely would have been different had we not found that article."

Clarke said the academy is about to install Vivisimo on its journal Pediatrics. It "allows one to look at things in a different way, and sometimes you come across articles you wouldn't otherwise find," he said.

As befits a product whose name means "very lively" in Spanish and Italian, Vivisimo can search in most of the European languages, as well as in Korean, Chinese and Arabic.

Even though Vivisimo's Web site, vivisimo.com, is only intended to be an online demonstration of its clustering capabilities, word of mouth has caused a constant increase in visitors, "even though we don't actively try to attract traffic to the site," said Valdes-Perez.

On June 12, Vivisimo announced that it had been awarded $350,000 in two Small Business Innovation Research grants from the National Science Foundation to develop homeland security applications, raising its total of research grants received to about $1 million.

Industry analysts predict that Google and other major search engines will need to make use of clustering technology to stay competitive.

Vivisimo may provide the answer to that challenge. Asked whether any of the industry heavyweights have shown interest in his company, Valdes-Perez would say only that "right now we have a pilot with a very major search engine going on."


Karen Hoffmann can be reached at khoffmann@post-gazette.com or 412-263-1994.

Search | Contact Us |  Site Map | Terms of Use |  Privacy Policy |  Advertise | Help |  Corrections