Archive for the ‘Data analysis’ Category

Recorded Future analyses streaming Web data to predict the future

Saturday, October 30th, 2010

Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.

Recorded Future system architecture

A recent Technology Review article, See the Future with a Search, describes it this way.

“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
   ”That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
   A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
   That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”

Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.

“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”

Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.

Papers with more references are cited more often

Sunday, August 15th, 2010

The number of citations a paper receives is generally thought to be a good and relatively objective measure of its significance and impact.

Researchers naturally are interested in knowing how to attract more citations to their papers. Publishing the results of good work helps of course, but everyone knows there are many other factors. Nature news reports on research by Gregory Webster that analyzed the 53,894 articles and review articles published in Science between 1901 and 2000.

The advice the study supports is “cite and you shall be cited”.

A long reference list at the end of a research paper may be the key to ensuring that it is well cited, according to an analysis of 100 years’ worth of papers published in the journal Science.
     The research suggests that scientists who reference the work of their peers are more likely to find their own work referenced in turn, and the effect is on the rise, with a single extra reference in an article now producing, on average, a whole additional citation for the referencing paper.
     ’There is a ridiculously strong relationship between the number of citations a paper receives and its number of references,” Gregory Webster, the psychologist at the University of Florida in Gainesville who conducted the research, told Nature. “If you want to get more cited, the answer could be to cite more people.’

A plot of the number of references listed in each article against the number of citations it eventually received reveal that almost half of the variation in citation rates among the Science papers can be attributed to the number of references that they include. And — contrary to what people might predict — the relationship is not driven by review articles, which could be expected, on average, to be heavier on references and to garner more citations than standard papers.

Han on Mining Knowledge from Databases

Monday, May 24th, 2010

AISL CO-PI Jiawei of the University of Illinois gave an invited talk at Microsoft Research on “Mining Knowledge from Databases: An Information Network Analysis Approach”. View the video of the talk and slides from the link below. Here is the abstract of the talk, which lasts about an hour and fifteen minutes.

“Most people consider a database is merely a data repository that supports data storage and retrieval.
Actually, a database contains rich, inter-related, multi-typed data and information, forming one or a set of gigantic, interconnected, heterogeneous information networks. Much knowledge can be derived from such information networks if we systematically develop an effective and scalable database-oriented information network analysis technology.
In this talk, we introduce database-oriented information network analysis methods and demonstrate how information networks can be used to improve data quality and consistency, facilitate data integration, and generate interesting knowledge. Moreover, we present interesting case studies on real datasets, including DBLP and Flickr, and show how interesting and organized knowledge can be generated from database-oriented information networks.”