OVERVIEW
For the last few months I've been working on a side project called Twitrank with two colleagues Maggie Neuwald and Quentin Swain. I wanted to present our work towards finding relevant tweets for current news events in order to aid users find real-time information about unfolding events. Our aim is to be able to return more relevant tweets about news events than Twitter’s native search, which ranks them by time in descending order. We create a set of queries using authoritative sources like nytimes.com, cnn.com and wsj.com for Twitter’s search API and use them to extract a set of news-related tweets from Twitter. We then employ an editor to judge the Twitter results for relevance. The judged results, along with meta-data about the tweet and the author of the tweet, train an optimization engine to generate a model with which to rank Twitter results for future tweets. In this paper, we describe our motivation for ranking results from Twitter queries regarding current news events, similarities and differences between our work and related work in this area, our architecture and approach to the problem, and our solutions and results thus far.
INTRODUCTION & MOTIVATION
Micro-blogging has become a popular way for many people to share data. Twitter is one such platform that offers users the ability to share information in the form of tweets. Tweets are updates that users can post about themselves and are limited to 140 characters in length. Twitter offers a medium for users to broadcast updates quickly at any time of day. Even though many tweets may refer to users’ personal lives, some tweets contain valuable information about late-breaking news and trends. Trends are identified as topics that a significant number of users are referencing in their tweets. This makes Twitter a data gold-mine because it allows users to consume information about events as they are occurring, and users may be able to find information about late-breaking news that has not yet been published on news websites, RSS feeds, or other sources of information.
With the rise of the internet, user-generated content, social networks and citizen journalism has become an increasingly prominent source of information. If people are looking for information related to an event that is unfolding, time and relevance become more important. Twitter as a source of information adds to the existing pool of data sources and significantly increases information available at any given time.
Micro-blogging has been cited in some recent events as a source of information (for instance, in protests in Iran). In the case of late-breaking news, citizen accounts from micro-blogs may provide information at a quicker rate than traditional news sources. [4] suggests that breaking news events and mass-convergence events cause shifts in the usage patterns of social networking sites. Their work primarily focuses on Twitter, but presents results that show that the events prompt users to both disseminate and broadcast information related to late-breaking or important events, and even prompts some users to adopt long-term usage patterns for social networks. Even sites like Wikipedia, which contains detailed information about events, are dependent on authors to update it. Services like Twitter, however, are often updated while events unfold in front of people, providing a platform for real-time witness accounts from a variety of users.
However, with the sheer number of users on Twitter and the fact that many tweets may be less relevant or informative about an event it becomes difficult to mine for the most relevant information regarding an event. We aim to rank Twitter search results about news events that are the most topically relevant tweets for that event.
RELATED WORK
Twitter, as well as blogging, is a popular topic for research in relation to ranking and information retrieval. However, although there are a few commercial systems that are starting to include micro-blogging in search results, such as Microsoft Bing and Google, there are fewer academically published works containing empirical evidence on relevance results. Tweets provide an interesting subject for ranking due to their constrained length, high volume, variety of uses, and the presence of noise and spam. As a result of these attributes, different signals can be selected to determine tweet relevance and for rank.
There been exploration into what signals and features of tweets can be helpful for ranking useful URLs when Twitter is leveraged in commercial search systems as a to aid in recency ranking for URLs that are feature impoverished [5]. Some of the same signals are also investigated in relation to ranking approaches that specifically focus on twitter. [1,6], [1] ,[2], and [6] examine different groups of features that can be employed when developing ranking models such as content features like length and similarity to query, user features like user authority based on number of followers, number of users followed, and tweet specific features like retweet count and embedded URLs.
The authority of the author, for example, becomes an important and a viable signal for ranking, due to the availability of meta-data about authors provided by Twitter. Nagmoti et al in [1] also discuss using author-related meta-data from Twitter to judge authority and relevance in ranking results. Goncalves et al., in [3] use popularity measures to rank blogs. However, [3] focuses more on traditional, full-length blogs and relies on published blog popularity lists to determine credibility and authority. Their work does not extend to Twitter specifically and towards the problem of finding relevant tweets as opposed to relevant blogs.
[1] and [2] suggest the importance and the role that URL and social features could play when applied in the blogging and micro-blogging environments. [2] performs an in-depth analysis of URLs embedded in tweets. It seeks to determine whether the number of tweets containing a URL as well as the URLs’ CTR (click through rate) correlates with quality and the presence of spam. We do not aim to determine the most relevant URLs present in tweets. However, we do take a similar approach to [1] by using the presence of a URL as an attribute in determining the overall relevance of a tweet. We may employ some simple filtering rules to detect spam, such as filtering out URLs that have been retweeted by the same user a number of times, but only if we find that it is necessary to obtain good results.
Our approach differs in other ways than [1], however. For instance, [1] employs preference relevance judgments, asking users to choose between two results to judge relevance. Rather, our editors will apply a relevance score to every result to achieve a similar effect. Additionally, many useful features are identified in [1] that could be used for effective ranking of tweets when searching, as well as weighting mechanisms to determine relevance. Our approach differs in that we seek to apply an optimization approach to generate a ranking model that can be applied to rank topically relevant tweets based on queries that are generated for events. By defining our domain as news events, we are able to construct a test data set from authoritative news agencies as our information sources in order to construct queries related to news events. These queries will then be used to obtain training sets from Twitter search. Additionally, we look to use Twitter as a source of information, in and of itself, whereas many of the papers hope to use signals from Twitter to better improve search engines or other retrieval systems.
[6] defines a series of features used to develop a ranking system for tweets that is learned using the RankSVM algorithm. Their results imply that user features that calculate the authority of a twitter user are more effective in producing an accurate ranking of tweets. The model we build is a logistic regression based model. RankSVM also employs pair-wise relevance judgments for training the model. Our approach will use a graded relevance approach. Also our test data set for cross-validation is made up of a total of 50 queries from the news event domain.
PROBLEM STATEMENT
Searching for tweets relevant to events or topics can be extremely useful depending on how the results are ranked and displayed to the user. But, many issues must be considered when considering how to rank tweets, such as:
- The length of tweets (if restricted to 140).
- Tweets are posted at random and there are large numbers of tweets being posted at any given time.
- Many tweets may not be as informative or relevant about a news event than others.
- Many tweets contain spam or are noisy and unrelated to events or topics that are currently considered trends.
These issues affect how the tweets can be ranked effectively in order to provide results that are topically relevant to user queries relating to events and trends. This experiment proposes that machine-learning optimization approaches to ranking of topically relevant tweets could provide a model for producing a superior method of ranking. We aim to retrieve Twitter posts that are most relevant for a specific news event that is occurring.
DEFINING RELEVANCE
Definitions of relevance can vary from person to person and differ depending on your objective as well. As such, we wanted to define exactly what constitutes a relevant result in the context of this experiment. We decided that the main purpose of the service would be to find facts and information regarding an event, for instance. Someone’s opinion may be informative but is less useful than new information or facts in this context. In further refining our definitions of relevance, we made the following assumptions:
- Users are mostly interested in information about an event. They are only interested in opinions that include relevant information about an event.
- They are interested in both text and links. If one is subpar but the other is not, they will find some value in the result, although it may not be as valuable as a result that is completely relevant.
- “Relevant” means that it includes facts about the event itself and is topically relevant to the event.
Based on these assumptions, we devised judgment criteria for ranking tweets about a given query:
| Score | Criteria |
| 0 | • All information is topically irrelevant to the event. • The text and links are completely unrelated to the event • There is no information given about the event and any opinions are unrelated to the event itself • Keywords are present but are not about the event or the text is about something else topically |
| 1 | • There is some topically irrelevant information present, but there is also some relevant information. • Either the text or link is unrelated to the event or uninformative. • Keywords may match but no relevant information is given – the tweet does not provide much information about the event and any opinions do not contain information about the event. |
| 2 | • Although some information may not be correct or may be pure opinion, there is still mostly topically relevant information about the event. • Perhaps the link is incorrect or the text is mostly opinion, but the tweet is mostly about the event· Keywords match and the tweet is topically relevant, or the link is relevant although there is not as much relevant text. |
| 3 | • All information present is correct and relevant • The text and links are completely related to the event and new or interesting information is given |
Judging by these criteria, we plan to train a model using twitter search results and the results’ metadata to find tweets for news events and rank them so that more relevant tweets appear higher in the result set.
Currently, Twitter’s native search engine sorts query results based on the date of creation, in descending order. This means that users can easily find the most recent tweets, but not necessarily the most relevant tweets for a given search query. We’ve limited the study of returning relevant tweets to the domain of news events to help refine our definition of relevance and set parameters around our test queries.
For the tweets that result in our initial queries, we normalize them to a set of common attributes to feed into the optimization engine. These attributes are:
- % of keywords in the tweet
- the number of retweets
- the location of the user in respect to the event
- the time of the tweet with respect to the event
- the number of friends of the user and the number of people the user follows
- the status count of the user
- the favorites count of the user
- the relevance score of the tweet
The optimization engine looks at the attributes of each result and applies a logistic regression model to define a function that can predict the relevance of a twitter result. We then use that function for our testing and evaluation. Our statement becomes:
Given Qs, rank Q1….Qt, so that top r results are listed in descending order
Where Qs is the set of results from a Twitter query,
Q1…Qt is each query in the set,
And r is relevance.
SYSTEM ARCHITECTURE
We have built an online application called TwitRank, which the editors will use to judge tweets. Event queries can be added to the training set from TwitRank and can then be used to retrieve tweets from Twitter. The interface is simple and allows editors to select a relevance value for each tweet saved using a drop down menu. When all tweets for a query are finished, the editor has finished working on that query. TwitRank is built on Ruby on Rails primary due to its easy of use and its highly configurable deployment capabilities. We interface with the Twitter API using a Ruby wrapper library. This library allows the editor to request authorization to access and query Twitter with our hand-picked queries and return tweet results with metadata. We have created a simple workflow mechanism allowing editors to judge tweets and assign them appropriate relevancy flags without knowledge of how to use TwitRank. Once the judging phase is complete we will optimize the n-tuples and train the system. Finally, we have to build an automated suite that runs cross-referencing iterations. We evaluate and discuss the results from cross-referencing experiments.
Our system is composed of five main components:
- Building a Query SetWe are creating test queries from authoritative news website like nytimes.com, cnn.com and wsj.com. Once we have built 50 queries we use them as input to the next component.
- Querying Twitter
Using the queries selected manually in the previous step, we will use a Ruby library to interface with the Twitter API and retrieve tweets returned by searching each query. Each tweet will contain the tweet body as well as metadata associated with it including a timestamp, retweet count, and location. We will use some of this metadata as ranking signals in the following steps.
- Judging Phase
Editors judge each tweet to determine whether it is relevant to its associated query. We will create n-tuples, each containing ranking signals, such as percent query keywords matched, tweet age, user location and retweet count. Given this metadata, the editor has to judge whether the tweet is relevant to the query. For this experiment, we assume that an editors’ judgment is satisfactorily accurate based on our relevance criteria.
- Optimization and Training
To generalize the process of judging tweets we optimize our n-tuples using sofia-ml, which is a “suite of fast incremental algorithms for machine learning.” Sofia-ml can be used to train models for ranking and is highly configurable. We use sofia-ml to generate an optimization function using a logistic regression technique, which we use to train our system iteratively. Once the function outputs satisfactory results we test and evaluate the system.
- Test and Evaluate
To test our system, we divide the data set into 10 mutually exclusive segments and run cross-referencing experiments. Eight segments are assigned to be the training set and we evaluate the other segments. This is repeated until each segment has been assigned to be the part of the training set once.
EXPERIMENTAL RESULTS
In order to evaluate this methodology, we employed a k-cross validation plan. We first split our judged query results into 12 different buckets and transformed them into rows of tuples in the format in the below format so that Sofia could read the file:
S<class-label> <aid> 1:<value>2:<value> 3:<value> 4:<value> 5:<value> 6:<value> 7:<value> 8:<value> 9:<value>
Our tuples had 9 signals, which were mapped to ids in the following way:
| Tuple ID | Signal |
| 1 | Tweet length |
| 2 | % of keywords in tweet |
| 3 | Number of retweets |
| 4 | Location of user with respect to event |
| 5 | Time difference of tweet with respect to event |
| 6 | Number of followers of user |
| 7 | Number of people the user follows |
| 8 | Status count of user |
| 9 | Favorites count of user |
The tuples contained the dependent variable (our judgment) as well as each signal. Below is an example tuple from our testing:
S3 1:3 2:66 3:0 4:100 5:6 6:101 7:65 8:1189 9:37 // this one was given a relevance of 3
Then we set up 6 tests, where in each test we used 10 buckets to train Sofia and two buckets to test Sofia. We iterated through the buckets, so that each buckets was used as to test the model one time and used to train the model the rest of the time. Sofia-ml provides a results file after each test that can be used to further evaluate the results. This results file lists Sofia’s predictive value for each query result, and the original relevance score assigned that result.
In order to evaluate the results, we grouped the results by their original query. As this was a ranking exercise, we decided to use the Discounted Gain (DCG) metric and precision to evaluate the success of the model.
COMPARING DISCOUNTED CUMULATIVE GAIN
In order to compare the discounted cumulative gain, we sorted results for each query in three ways as a means of comparison: by the Sofia-ml results, by the same method as Twitter’s search, and by the ideal. To sort by Sofia-ml’s ranking, we merged the results file with the original query ID, and sorted by Sofia-ml’s number for each query. To recreate Twitter’s ranking, we also sorted each query’s results by the time difference (so that more recent posts would be ranked first). For popular queries, there were many tweets that had been entered at the same time, causing some possible variety in this ranking, however it should be a reasonable proxy for Twitter’s ranking for almost all queries. To sort by the ideal, we simply then listed each queries results, sorted by our own relevance ranking.
To calculate the DCG, we first looked at the average DCG across all queries for the three rankings at position 10. We discounted the cumulative gain using a logarithmic base of 2 in this initial evaluation. The results of this showed that Sofia-ml performed slightly greater than Twitter in terms of absolute values, but that the difference was not statistically significant.
 |
| Results: All test buckets at position 10. p-value > 0.05 |
 |
| Average Results between ranking methods at position 10. 1 is Sofia, 2 is Twitter and 3 is ideal. p-value > 0.05. |
 |
| Results: All rank positions at logarithmic scale at base 10. DCG is on the x-axis. p-value > 0.05. |
DISCUSSION AND FUTURE WORK
Although our results were not conclusive and did not provide significant gain over an ordering by time, we feel that further improvements in future work could be made to possibly provide more drastic results. Our model may be improved by changing the signals we use for relevance, for instance, or by increasing improvements in our relevance criteria. Additionally, a number of tweets were identical. We feel that looking at a greater number of unique tweets may also provide for better data (as it resulted in many judgments of 3). In future work, we’d like to come up with some hypothesizes as to what may have affected the performance of the model and try to refine our methodology to produce better results through the usage of more predictive signals and better training data. As it stands, date seems to be highly correlated with the signals used for this particular test, given our judgment criteria. And perhaps ordering by most recent tweets gives a good approximation for relevance in many queries.
CONCLUSION
We aim to rank tweets based on their relevance to a user’s query, within the domain of current, late-breaking news events. To test this, we enlisted a machine-based learning algorithm to rank tweets based on predictive signals. To train the model, we used top news events gathered from authoritative news sources like nytimes.com to create a dataset of queries. We simulated late-breaking, current events by limiting our Twitter results to those occurring within a two week time period from query issuance. We used human editors to assign a relevance score to the Twitter results and train our optimization model based on the relevance score, % of keywords, and a number of attributes related to the content of the tweet, the authority of the user, and the proximity of them to the event in question. We then tested and evaluated our model through test and control buckets covering our dataset. We hope that through continued work, we will be able to create a ranking algorithm for Twitter queries on late-breaking news that will supply users’ with more relevant tweets regarding their query than Twitter’s current search function allows.
REFERENCES
[1] Rinkesh Nagmoti, Rinkesh, Ankur Teredesai, and Martine De Cock. "Ranking Approaches for Microblog Search." 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. (2010): 153-57. Print.
[2] Kandylas, Vasileios, and Ali Dasdan. "The Utility of Tweeted URLs for Web Search." WWW 2010,ACM. (2010): 1127-28.
[3] Gonçalves, Marcos André, Jussara Almeida, Luiz dos Santos, Alberto Laender, and Virgílio Almeida. "On Popularity in the BlogosphereSearch." 2010 IEEE. (2010): 1-16. Print.
[4]. L. Hughes and L. Palen. “Twitter adoption and use in mass convergence and emergency events.” In Proceedings of the 6th International Conference on Information Systems forCrisis Response and Management, 2009.
[5] Dong Anlei, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. “Time of the essence: improving recency ranking using Twitter data.” In the proceedings of the 19th International
Conference on World Wide Web, Pages: 331-340.
[6] Duan, Jiang, Qin, Zhou, and Shum. “An empirical study on learning to rank of tweets.” COLING, 2010.
[7] D. Sculley. Large scale learning to rank. In NIPS '09 Workshop on Advances in Ranking, 2009.
[8] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumulatedgain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446.