May 27, 2011

Chrome Extensions: Browser UI

No comments:

  • A browser action sits in the Chrome toolbar
  • Has an icon, a tooltip, a badge and a popup
    • Tooltip: Refers to the "default_title" in the "browser_action" attribute in the manifest file
    • Badge: Updates the browser action using the state of the extension
    • Popup: Contains HTML content
  • Browser actions are registered in the extension manifest file
  • Methods:
    • setBadgeBackgroundColor
    • setBadgeText
    • setIcon
    • setPopup
    • setTitle
  • Events:
    • onClicked

  • Add items to Chrome context menu via this module
  • Additions are applied to pictures, links and webpages
  • Multiple context menus can be created, but only one per extension is visible. Others are collapsed
  • Context Menu permissions must be requested in the manifest file
  • Methods:
    • create
    • remove
    • removeAll
    • update
  • Types:
    • onClickData

  • Used to notify users about events
  • Notifications are external to the browser and look different depending on the platform
  • The notification window can be created using html/javascript
  • Desktop Notification permissions must be requested in the manifest file
  • Notifications can also interact with other views in the extension

  • Keywords can be registered with the Chrome address bar (a.k.a. omnibox)
  • Acts like an autocomplete plugin sending each keystroke to the extension so that results can be suggested
  • When a suggestion is selected the extension is notified
  • Omnibox keywords have to be specified in the manifest file
  • Methods:
    • setDefaultSuggestion
  • Events:
    • onInputCancelled
    • onInputChanged
    • onInputEntered
    • onInputStarted
  • Types
    • SuggestResult

  • Allows users to customize extension behavior
  • Linked to from the Chrome extensions management page
  • Declared in the manifest file under "options_page"

  • Override Chrome default pages with an HTML page from the extension
  • Pages that can be replaced are:
    • New tab page
    • Bookmark pages
    • Pages from the history
  • Check documentation for incognito pages
  • Page overrides have to be declared in the manifest file
  • The omnibox always gets focus when a new tab is opened, so do not rely on pages with keyboard focus

  • Actions that can be taken to interact with only the current page
  • Page Actions must be registered in the manifest file
  • Can have:
    • Icons
    • Tooltips
    • Popups
  • Can appear and disappear unlike browser actions, which are always visible
  • Methods:
    • hide
    • setIcon
    • setPopup
    • setTitle
    • show
  • Events:
    • onClicked

May 25, 2011

My Bookshelf

No comments:
In the past 6 months I have bought over 10 books varying from Rails development, to clean coding to architecture. Just wanted to put them down somewhere and hopefully I'll get to import this list to my Scribd account soon. So, without further ado here it is:

  1. Achieving Service-Oriented Architecture: Applying an Enterprise Architecture Approach, Sweeney
  2. Rails Antipatterns, Pytel & Saleh
  3. Distributed Programming with Ruby, Bates
  4. Clean Code, Martin
  5. The Art of SEO, Enge, Spencer, Fishkin & Stricchiola
  6. Java Web Services: Up and Running, Kalin
  7. Learning Rails, St. Laurent & Dumbill
  8. Deploying Rails Applications, Zygmuntowicz, Tate & Begin
  9. Agile Web Development with Rails, Ruby, Thomas & Heinemeier-Hansson
  10. Metaprogramming Ruby, Perrotta
  11. Pro CSS Techniques, Croft, Lloyd & Rubin
  12. Scripting Intelligence, Watson
  13. Sexy Web Design, Stocks
  14. OOP Demystified, Keogh & Gianinni
  15. Advanced Rails Recipes, Clark

May 24, 2011

eCommerce - Highest Order of the Day

No comments:
The Problem
Find the highest costing order of the day at any given time during the day.

Restriction
An order may get cancelled, but only if the order(s) placed after it are cancelled.

Example
Orders coming in: A B C D E F G H
Time of day:         ------------------------>

Lets assume the numbers:
Order Price
A        $10
B        $20
C        $15
D        $33
E        $97
F        $56
G        $9
H        $22

Order F can only be cancelled if orders G and H are cancelled before it.
In this case, the highest costing order is E at $97

Approach (using above example)

  1. Since orders can be cancelled and due the restriction, lets use a stack as the data structure to organize the information.
  2. We only want to retrieve the highest order at any given time.
  3. If data is stored in a stack, we can only access the top of the stack.
  4. So, lets only insert an order if its value is higher than the value of the order at the top of the stack.
  5. At the end of the day, our stack will contain (from bottom to top): A, B, D and E.
  6. E is the top of the stack and can easily be retrieved as the highest order of the day
Conclusion
Try your own scenarios to practice such conditions because they are part of commonly occurring themes in eCommerce applications. Stacks, queues, linked lists, etc seem like too unnecessary or too theoritical to new programmers. Exploring real examples is the only way to understand why, how and where to use these data structures.

Twitrank: Planning the architecture

No comments:
Overview
I recently worked on an application called Twitrank, (a tweet re-ranking engine). It is not nearly complete, but I have setup the backbone on a shared server.

Who Should Use Twitrank?
Anyone who has an interest in mining Twitter for:
  • interesting events
  • hot topics
  • user statistics
  • general data mining

Why Use Twitrank?
To find non-redundant/relevant data.

Some Background
It has been quite a challenge to figure out how to setup the system. Understanding the motivation and logic behind Twitrank requires some knowledge of:
  • information extraction
  • data mining
  • machine learning
The details for how Twitrank works are available on my previous blogpost. In this blogpost, I wanted to show you how I planned for the project. I only used a small whiteboard and a few of blank sheets to plan the architecture and the UI. Lets have a look...

The architecture is split up into 3 phases: Collection, Judgements and Ranked Searching. Ignore the order of the phases shown above since this was from the planning stage. I like to make flowcharts showing the over flow of how data will flow through the app. I make notes in blue to indicate into intermediate actions that will be performed on the data. Boxes in red are the parts that need to be built into the system. On the right side, I show the approximate UI that I envision. This is very flexible and almost always ends up changing. On the bottom right I show the associations between the models. In the middle I make notes to remind myself of whats needed to be done. All of this is flexible and changes from app to app. The point of whiteboarding is to keep things moving and to make easy updates to the plan before I start coding. I like to see what I'm doing ahead of time. This fleshes out the specifics of the app and helps find bottlenecks/hinderances before they can occur.
As I brainstormed, I erased some outdated notes with new ideas. Here, I show a graph where each node is a tweeter on Twitter (I call Twitter posts tweets and Twitter users tweeters in the app). The edges between tweeters represent that they are connected. The idea was to rank tweeters by assigning them a score. This is sort of similar to the generic PageRank model. Being connected to a tweeter with high rank could make another tweeters' rank higher. I distinguish 'following' connections in red. There is a need to distinguish between an edge that represents 'following' and one representing 'follower'. This data could be used in analysis to determine which tweeters post relevant posts generally. Using this data, tweets from such tweeters receive a slight boost in ranking too. 
Just a zoomed in shot showing the models used. 'user' model corresponds to a user of Twitrank, not Twitter. Twitter users are represented by 'tweeters'. 'tweeters'  post 'tweets' containing 'urls'. 'users' create 'queries' for which 'tweets' are retrieved from Twitter. Simple enough!
This one shows the updated architecture. Queries issued to Twitter via API and results stored in db. The Twitrank GUI allows users to judge tweets for relevance relative to the query offered. A model is generated/updated using the machine learning algorithm software Sofia-ml. The model is finalized and stored in the db and is used when Twitrank is used for mining data.

May 15, 2011

Twitrank: Slides

No comments:
I have uploaded a set of slides to summary Twitrank. Have a look: Twitrank. Any feedback is welcome

TwitRank: Extracting and Ranking Relevant Tweets For News Events

1 comment:
OVERVIEW

For the last few months I've been working on a side project called Twitrank with two colleagues Maggie Neuwald and Quentin Swain. I wanted to present our work towards finding relevant tweets for current news events in order to aid users find real-time information about unfolding events. Our aim is to be able to return more relevant tweets about news events than Twitter’s native search, which ranks them by time in descending order. We create a set of queries using authoritative sources like nytimes.com, cnn.com and wsj.com for Twitter’s search API and use them to extract a set of news-related tweets from Twitter. We then employ an editor to judge the Twitter results for relevance. The judged results, along with meta-data about the tweet and the author of the tweet, train an optimization engine to generate a model with which to rank Twitter results for future tweets. In this paper, we describe our motivation for ranking results from Twitter queries regarding current news events, similarities and differences between our work and related work in this area, our architecture and approach to the problem, and our solutions and results thus far.
 
INTRODUCTION & MOTIVATION

Micro-blogging has become a popular way for many people to share data. Twitter is one such platform that offers users the ability to share information in the form of tweets. Tweets are updates that users can post about themselves and are limited to 140 characters in length. Twitter offers a medium for users to broadcast updates quickly at any time of day. Even though many tweets may refer to users’ personal lives, some tweets contain valuable information about late-breaking news and trends. Trends are identified as topics that a significant number of users are referencing in their tweets. This makes Twitter a data gold-mine because it allows users to consume information about events as they are occurring, and users may be able to find information about late-breaking news that has not yet been published on news websites, RSS feeds, or other sources of information.

With the rise of the internet, user-generated content, social networks and citizen journalism has become an increasingly prominent source of information. If people are looking for information related to an event that is unfolding, time and relevance become more important. Twitter as a source of information adds to the existing pool of data sources and significantly increases information available at any given time.

Micro-blogging has been cited in some recent events as a source of information (for instance, in protests in Iran). In the case of late-breaking news, citizen accounts from micro-blogs may provide information at a quicker rate than traditional news sources. [4] suggests that breaking news events and mass-convergence events cause shifts in the usage patterns of social networking sites. Their work primarily focuses on Twitter, but presents results that show that the events prompt users to both disseminate and broadcast information related to late-breaking or important events, and even prompts some users to adopt long-term usage patterns for social networks. Even sites like Wikipedia, which contains detailed information about events, are dependent on authors to update it. Services like Twitter, however, are often updated while events unfold in front of people, providing a platform for real-time witness accounts from a variety of users.

However, with the sheer number of users on Twitter and the fact that many tweets may be less relevant or informative about an event it becomes difficult to mine for the most relevant information regarding an event. We aim to rank Twitter search results about news events that are the most topically relevant tweets for that event.

RELATED WORK

Twitter, as well as blogging, is a popular topic for research in relation to ranking and information retrieval. However, although there are a few commercial systems that are starting to include micro-blogging in search results, such as Microsoft Bing and Google, there are fewer academically published works containing empirical evidence on relevance results. Tweets provide an interesting subject for ranking due to their constrained length, high volume, variety of uses, and the presence of noise and spam. As a result of these attributes, different signals can be selected to determine tweet relevance and for rank.

There been exploration into what signals and features of tweets can be helpful for ranking useful URLs  when Twitter is leveraged in commercial search systems as a to aid in recency ranking for URLs that are feature impoverished [5]. Some of the same signals are also investigated in relation to ranking approaches that specifically focus on twitter. [1,6], [1] ,[2], and [6] examine different groups of features that can be employed when developing ranking models such as content features like length and similarity to query, user features like user authority based on number of followers, number of users followed, and tweet specific features like  retweet count and embedded URLs.

The authority of the author, for example, becomes an important and a viable signal for ranking, due to the availability of meta-data about authors provided by Twitter. Nagmoti et al in [1] also discuss using author-related meta-data from Twitter to judge authority and relevance in ranking results. Goncalves et al., in [3] use popularity measures to rank blogs. However, [3] focuses more on traditional, full-length blogs and relies on published blog popularity lists to determine credibility and authority. Their work does not extend to Twitter specifically and towards the problem of finding relevant tweets as opposed to relevant blogs.

[1] and [2] suggest the importance and the role that URL and social features could play when applied in the blogging and micro-blogging environments. [2] performs an in-depth analysis of URLs embedded in tweets. It seeks to determine whether the number of tweets containing a URL as well as the URLs’ CTR (click through rate) correlates with quality and the presence of spam. We do not aim to determine the most relevant URLs present in tweets. However, we do take a similar approach to [1] by using the presence of a URL as an attribute in determining the overall relevance of a tweet. We may employ some simple filtering rules to detect spam, such as filtering out URLs that have been retweeted by the same user a number of times, but only if we find that it is necessary to obtain good results.

Our approach differs in other ways than [1], however. For instance, [1] employs preference relevance judgments, asking users to choose between two results to judge relevance. Rather, our editors will apply a relevance score to every result to achieve a similar effect. Additionally, many useful features are identified in [1] that could be used for effective ranking of tweets when searching, as well as weighting mechanisms to determine relevance. Our approach differs in that we seek to apply an optimization approach to generate a ranking model that can be applied to rank topically relevant tweets based on queries that are generated for events. By defining our domain as news events, we are able to construct a test data set from authoritative news agencies as our information sources in order to construct queries related to news events. These queries will then be used to obtain training sets from Twitter search. Additionally, we look to use Twitter as a source of information, in and of itself, whereas many of the papers hope to use signals from Twitter to better improve search engines or other retrieval systems.

[6] defines a series of features used to develop a ranking system for tweets that is learned using the RankSVM algorithm. Their results imply that user features that calculate the authority of a twitter user are more effective in producing an accurate ranking of tweets. The model we build is  a logistic regression based model. RankSVM also employs pair-wise relevance judgments for training the model. Our approach will use a graded relevance approach. Also our test data set for cross-validation is made up of a total of 50 queries from the news event domain.

PROBLEM STATEMENT

Searching for tweets relevant to events or topics can be extremely useful depending on how the results are ranked and displayed to the user. But, many issues must be considered when considering how to rank tweets, such as:
  1. The length of tweets (if restricted to 140).
  2. Tweets are posted at random and there are large numbers of tweets being posted at any given time.
  3.  Many tweets may not be as informative or relevant about a news event than others.
  4. Many tweets contain spam or are noisy and unrelated to events or topics that are currently considered trends.

These issues affect how the tweets can be ranked effectively in order to provide results that are topically relevant to user queries relating to events and trends. This experiment proposes that machine-learning optimization approaches to ranking of topically relevant tweets could provide a model for producing a superior method of ranking. We aim to retrieve Twitter posts that are most relevant for a specific news event that is occurring.

DEFINING RELEVANCE

Definitions of relevance can vary from person to person and differ depending on your objective as well. As such, we wanted to define exactly what constitutes a relevant result in the context of this experiment. We decided that the main purpose of the service would be to find facts and information regarding an event, for instance.  Someone’s opinion may be informative but is less useful than new information or facts in this context. In further refining our definitions of relevance, we made the following assumptions:
  • Users are mostly interested in information about an event. They are only interested in opinions that include relevant information about an event.
  • They are interested in both text and links. If one is subpar but the other is not, they will find some value in the result, although it may not be as valuable as a result that is completely relevant.
  • “Relevant” means that it includes facts about the event itself and is topically relevant to the event. 

Based on these assumptions, we devised judgment criteria for ranking tweets about a given query:
Score
Criteria
0
     All information is topically irrelevant to the event.
     The text and links are completely unrelated to the event
     There is no information given about the event and any opinions are unrelated to the event itself
     Keywords are present but are not about the event or the text is about something else topically
1
     There is some topically irrelevant information present, but there is also some relevant information.
     Either the text or link is unrelated to the event or uninformative.
   Keywords may match but no relevant information is given – the tweet does not provide much information about the event and any opinions do not contain information about the event.
2
     Although some information may not be correct or may be pure opinion, there is still mostly topically relevant information about the event.
     Perhaps the link is incorrect or the text is mostly opinion, but the tweet is mostly about the event·              Keywords match and the tweet is topically relevant, or the link is relevant although there is not as much relevant text.
3
     All information present is correct and relevant
     The text and links are completely related to the event and new or interesting information is given


Judging by these criteria, we plan to train a model using twitter search results and the results’ metadata to find tweets for news events and rank them so that more relevant tweets appear higher in the result set.  

Currently, Twitter’s native search engine sorts query results based on the date of creation, in descending order. This means that users can easily find the most recent tweets, but not necessarily the most relevant tweets for a given search query. We’ve limited the study of returning relevant tweets to the domain of news events to help refine our definition of relevance and set parameters around our test queries.

For the tweets that result in our initial queries, we normalize them to a set of common attributes to feed into the optimization engine. These attributes are:
  • % of keywords in the tweet
  • the number of retweets
  • the location of the user in respect to the event
  • the time of the tweet with respect to the event
  • the number of friends of the user and the number of people the user follows
  • the status count of the user
  • the favorites count of the user
  • the relevance score of the tweet

The optimization engine looks at the attributes of each result and applies a logistic regression model to define a function that can predict the relevance of a twitter result. We then use that function for our testing and evaluation. Our statement becomes:

Given Qs, rank Q1….Qt, so that top r results are listed in descending order
Where Qs is the set of results from a Twitter query,
Q1…Qt is each query in the set,
And r is relevance.

SYSTEM ARCHITECTURE

We have built an online application called TwitRank, which the editors will use to judge tweets. Event queries can be added to the training set from TwitRank and can then be used to retrieve tweets from Twitter. The interface is simple and allows editors to select a relevance value for each tweet saved using a drop down menu. When all tweets for a query are finished, the editor has finished working on that query. TwitRank is built on Ruby on Rails primary due to its easy of use and its highly configurable deployment capabilities. We interface with the Twitter API using a Ruby wrapper library. This library allows the editor to request authorization to access and query Twitter with our hand-picked queries and return tweet results with metadata. We have created a simple workflow mechanism allowing editors to judge tweets and assign them appropriate relevancy flags without knowledge of how to use TwitRank. Once the judging phase is complete we will optimize the n-tuples and train the system. Finally, we have to build an automated suite that runs cross-referencing iterations. We evaluate and discuss the results from cross-referencing experiments.

Our system is composed of five main components:
  1. Building a Query SetWe are creating test queries from authoritative news website like nytimes.com, cnn.com and wsj.com. Once we have built 50 queries we use them as input to the next component.
  2. Querying Twitter
    Using the queries selected manually in the previous step, we will use a Ruby library to interface with the Twitter API and retrieve tweets returned by searching each query. Each tweet will contain the tweet body as well as metadata associated with it including a timestamp, retweet count, and location. We will use some of this metadata as ranking signals in the following steps.
  3. Judging Phase
    Editors judge each tweet to determine whether it is relevant to its associated query. We will create n-tuples, each containing ranking signals, such as percent query keywords matched, tweet age, user location and retweet count. Given this metadata, the editor has to judge whether the tweet is relevant to the query. For this experiment, we assume that an editors’ judgment is satisfactorily accurate based on our relevance criteria.
  4. Optimization and Training
    To generalize the process of judging tweets we optimize our n-tuples using sofia-ml, which is a “suite of fast incremental algorithms for machine learning.”  Sofia-ml can be used to train models for ranking and is highly configurable. We use sofia-ml to generate an optimization function using a logistic regression technique, which we use to train our system iteratively. Once the function outputs satisfactory results we test and evaluate the system.
  5. Test and Evaluate
    To test our system, we divide the data set into 10 mutually exclusive segments and run cross-referencing experiments. Eight segments are assigned to be the training set and we evaluate the other  segments. This is repeated until each segment has been assigned to be the part of the training set once.
EXPERIMENTAL RESULTS

In order to evaluate this methodology, we employed a k-cross validation plan. We first split our judged query results into 12 different buckets and transformed them into rows of tuples in the format in the below format so that Sofia could read the file:
S<class-label> <aid> 1:<value>2:<value> 3:<value> 4:<value> 5:<value> 6:<value> 7:<value> 8:<value> 9:<value>

Our tuples had 9 signals, which were mapped to ids in the following way:

Tuple ID
Signal
1
Tweet length
2
% of keywords in tweet
3
Number of retweets
4
Location of user with respect to event
5
Time difference of tweet with respect to event
6
Number of followers of user
7
Number of people the user follows
8
Status count of user
9
Favorites count of user


The tuples contained the dependent variable (our judgment) as well as each signal. Below is an example tuple from our testing:
S3 1:3 2:66 3:0 4:100 5:6 6:101 7:65 8:1189 9:37         // this one was given a relevance of 3

Then we set up 6 tests, where in each test we used 10 buckets to train Sofia and two buckets to test Sofia. We iterated through the buckets, so that each buckets was used as to test the model one time and used to train the model the rest of the time. Sofia-ml provides a results file after each test that can be used to further evaluate the results. This results file lists Sofia’s predictive value for each query result, and the original relevance score assigned that result.

In order to evaluate the results, we grouped the results by their original query. As this was a ranking exercise, we decided to use the Discounted Gain (DCG) metric and precision to evaluate the success of the model. 


COMPARING DISCOUNTED CUMULATIVE GAIN

In order to compare the discounted cumulative gain, we sorted results for each query in three ways as a means of comparison: by the Sofia-ml results, by the same method as Twitter’s search, and by the ideal. To sort by Sofia-ml’s ranking, we merged the results file with the original query ID, and sorted by Sofia-ml’s number for each query. To recreate Twitter’s ranking, we also sorted each query’s results by the time difference (so that more recent posts would be ranked first). For popular queries, there were many tweets that had been entered at the same time, causing some possible variety in this ranking, however it should be a reasonable proxy for Twitter’s ranking for almost all queries. To sort by the ideal, we simply then listed each queries results, sorted by our own relevance ranking.

To calculate the DCG, we first looked at the average DCG across all queries for the three rankings at position 10. We discounted the cumulative gain using a logarithmic base of 2 in this initial evaluation. The results of this showed that Sofia-ml performed slightly greater than Twitter in terms of absolute values, but that the difference was not statistically significant.

Results: All test buckets at position 10. p-value > 0.05
Average Results between ranking methods at position 10. 1 is Sofia, 2 is Twitter and 3 is ideal. p-value > 0.05.
Results: All rank positions at logarithmic scale at base 10. DCG is on the x-axis. p-value > 0.05.

DISCUSSION AND FUTURE WORK

Although our results were not conclusive and did not provide significant gain over an ordering by time, we feel that further improvements in future work could be made to possibly provide more drastic results. Our model may be improved by changing the signals we use for relevance, for instance, or by increasing improvements in our relevance criteria. Additionally, a number of tweets were identical. We feel that looking at a greater number of unique tweets may also provide for better data (as it resulted in many judgments of 3). In future work, we’d like to come up with some hypothesizes as to what may have affected the performance of the model and try to refine our methodology to produce better results through the usage of more predictive signals and better training data.  As it stands, date seems to be highly correlated with the signals used for this particular test, given our judgment criteria.  And perhaps ordering by most recent tweets gives a good approximation for relevance in many queries.

CONCLUSION

We aim to rank tweets based on their relevance to a user’s query, within the domain of current, late-breaking news events. To test this, we enlisted a machine-based learning algorithm to rank tweets based on predictive signals. To train the model, we used top news events gathered from authoritative news sources like nytimes.com to create a dataset of queries. We simulated late-breaking, current events by limiting our Twitter results to those occurring within a two week time period from query issuance. We used human editors to assign a relevance score to the Twitter results and train our optimization model based on the relevance score, % of keywords, and a number of attributes related to the content of the tweet, the authority of the user, and the proximity of them to the event in question. We then tested and evaluated our model through test and control buckets covering our dataset. We hope that through continued work, we will be able to create a ranking algorithm for Twitter queries on late-breaking news that will supply users’ with more relevant tweets regarding their query than Twitter’s current search function allows. 


REFERENCES

[1] Rinkesh Nagmoti, Rinkesh, Ankur Teredesai, and Martine De Cock. "Ranking Approaches for Microblog Search." 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. (2010): 153-57. Print.    

[2] Kandylas, Vasileios, and Ali Dasdan. "The Utility of Tweeted URLs for Web Search." WWW 2010,ACM. (2010): 1127-28.

[3] Gonçalves, Marcos André, Jussara Almeida, Luiz dos Santos, Alberto Laender, and Virgílio Almeida. "On Popularity in the BlogosphereSearch." 2010 IEEE. (2010): 1-16. Print.

[4]. L. Hughes and L. Palen. “Twitter adoption and use in mass convergence and emergency events.” In Proceedings of the 6th International Conference on Information Systems forCrisis Response and Management, 2009.

[5] Dong Anlei, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. “Time of the essence: improving recency ranking using Twitter data.” In the proceedings of the 19th International
 Conference on World Wide Web, Pages: 331-340.

[6] Duan, Jiang, Qin, Zhou, and Shum. “An empirical study on learning to rank of tweets.” COLING, 2010.

[7] D. Sculley. Large scale learning to rank. In NIPS '09 Workshop on Advances in Ranking, 2009.

[8] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumulatedgain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446.

May 2, 2011

The Problem With Modern RDBMSs

No comments:
There are two fundamental problems at work here. First, relational database systems are based on the “Closed World Assumption”: information that is not in the database is considered to be false or non-existent. Second, relational databases are extremely literal. They expect that data has been properly cleaned and validated before entry and do not natively tolerate inconsistencies in data or queries.

Michael Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. CrowdDB: Answering Queries with Crowdsourcing. SIGMOD 2011