A new approach to QoS driven service selection in service oriented architectures

Valeria Cardellini; Valerio Di Valerio; Vincenzo Grassi; Stefano Iannucci; Francesco Lo Presti

doi:10.1109/SOSE.2011.6139098

Abstract

This paper presents TweeVist, a geo-tweet visualization system to support users grasp event happens over time and space from tweets while they browse any web pages based on spatio-temporal analysis. TweeVist presents a tag cloud of tweets in different time periods are associated with web pages based on detected events. In order to detect events, the system extracts normal events (e.g., crowded restaurants, crowded facilities in Walt Disney World) happen at anytime and anywhere by utilizing machine learning algorithms, and it also extracts unusual or special events (e.g., time sales, disasters) by comparing current situation to those normal regularities. Thus, TweeVist can effectively visualize a summary of tweets in a tag cloud to help users immediately gain a quick overview of current situation or events through time and space while they browse a web page, and it can also effectively present a list of most related tweets to help users easily obtain more detailed information. Furthermore, TweeVist provides a communication function, which allows users to chat with other users who browse the same web pages, or Twitter users who follow an account of TweeVist.

I. Introduction

Microblogging services such as Twitter and Tumblr are frequently used to express personal opinions, or to share massive updates of events such as daily life activities and incidents. Previous works usually assumed that tweets mainly describe “what's happening now,” they have been also a common target for detecting real-world events based on geographical areas or location mentions. For example, a realworld event detection system based on geographical pattern mining and content analysis ^[8],^[9]. Other examples for newly visualization applications on microblogs include a tool for aggregating and visualizing categorical attributes in Twitter data through interactive map-based visualization ^[2], and a system for querying, analyzing, and visualizing geo-tagged microblogs ^[5]. However, these works focused on location-based analysis of microblogging, they did not detect tweets by considering locations with height information of composite buildings. Actually, there are many events happen at any time such as crowded restaurants and popular shops in different floors of composite buildings at urban areas. While browsing web pages without real-time updating, users are difficult to obtain latest events or current situation from Twitter users. Therefore, it is important to detect spatio-temporal events include both normal events and unusual events to help us to better understand dense space related tweets over time. To achieve them, not only a map-based tweet mapping method, but also a spatio-temporal mapping method of tweets and all web page is necessary.

In this paper, we advocate TweeVist , a novel system for visualizing tweets in different time periods are associated to web pages based on detected events. Although several tweet mapping techniques have been studied in our previous works ^[10],^[12], we have focused on location-based mapping through the contents of tweets and web pages based on latitude and longitude only, we do not solve the mentioned issues about height and temporal information of tweets. Fig. 1 illustrates an example of streaming tweets describe about different floors of a large shopping mall nearby Osaka station in Japan, called “LUCUA Osaka,” a tweet “Yummy!” reports a comment about foods on the restaurant floor, a tweet “Cute nails!” reports an impression of the lifestyle floor, and a tweet “Fit for me! I will buy it.” reports an activity about shopping on the fashion floor.

For this, we first propose a tweet acquisition method that collects geo-tagged tweets based on content analysis and region selection. Then, our method can detect tweets that are related to target locations even though the tweets do not include location names, and our method can also filter out the tweets from target locations for which content is not related to the target locations. This allows us to map acquired tweets to web pages by matching location names extracted from the acquired tweets and the web pages, and detect events by using three types of machine learning algorithms to classify tweets into different categories of small-scale facilities from floor information of web pages refer to different time frames of a day. To sum up, TweeVist has two main features: 1) mapping tweets to web pages based on spatial and temporal information; 2) visualizing a summary of tweets in a tag cloud with web pages that can be changed over different time periods.

The next section provides an overview of our system and reviews related work. Section 3 explains how to map tweets to web pages based on spatio-temporal analysis. Section 4 describes visualization of tweets. Experimental results and conclusions are given in Sections 5 and 6, respectively. Graphic: An example of tweets of LUCUA osaka.

Fig. 1.Fig. 1. An example of tweets of LUCUA osaka.

II. System Overview and Related Work

A. TweeVist: A Geo-Tweet Visualization System for Web

The processing flow of TweeVist is shown in Fig. 2. To use this system, users are required to simply install a toolbar (a browser plug-in). The system acquires geo-tagged tweets from a certain region by using Twitter Streaming API ¹. The certain region is determined by a northeast point and a southwest point, we then obtain tweets in a rectangular region surrounding these two points. The system acquire the URLs of web pages that users are browsing with the installed toolbar in a Web browser. Once a user browses a web page with the installed toolbar, the system records the information into a database, which is used for mapping tweets to the web page based on a location name detected from the tweets and the web page, and classifying the tweets in different time frames of a day based on category names of small-scale facilities from floor information of the web page. The system then returns a tag cloud of tf-idf based words in tweets is associated with the web pages during different time periods. Moreover, web users can chat with both Twitter users who follow an account ² of our system and other web users who browse the same web page ^[10]. In our system, anonymous of all messages (tweets) can be maintained through Twitter services and a WebSocket server ³.

The flow of our system is described as follows:

After a web user selects a web page to browse, the system then returns a tag cloud of tweets under the web page is presented in a Web browser; and tags can be changed by the web user's selected time period.
When the web user clicks a tag, the system then presents a list of tweets about it, in which most related tweets are presented in the Web browser.
When the web user sends a message through a chat box of our system, the system presents it in the tweet list, Twitter users or other web users can receive it.
When a Twitter user replies to the message of the web user through Twitter service; the system presents the reply relating to the web page in the tweet list.

B. Related Work

Recently, event detection on Twitter has been a very popular research area with many applications, such as a trend detection system by identifying bursty words ^[4], the discovery of breaking events by using hashtags in Twitter ^[1], and an open-domain calendar of significant events extracted from Twitter ^[8]. Our work is unique from these studies because our analysis aims to explore spatio-temporal events from tweets that may help in providing users with more complete and useful information.

Several studies have focused on problems such as the summarization and detection of topics in Twitter as well as the mass clustering of tweets. TweetMotif taken an unsupervised approach to tweet clustering ^[6]. Our goal is similar to this work, we utilize a clustering method to categorize the tweets based on spatial and temporal information. For detecting location information of tweets, Yamaguchi et al. ^[11] proposed an online location inference method that exploits the spatiotemporal correlation over social streams to predict users' locations without geo-tagged data. In our study, we utilized geo-tagged tweets and web pages to predict users' current locations, even the tweets do not contain local words.

Numerous research works have shown the usefulness of tweet classification in many domains, ranging from detecting user communities are recommended to the advertisers in Twitter ^[7], and real-time and geospatial event features from tweets to enhance event awareness ^[3]. These works focused on topics in Twitter, in our work, in order to classify tweets into small-scale facilities in different time periods, we focused on spatial and temporal information of the tweets.

III. Mapping Function

A. Tweet Analysis

For analyzing tweets, we first detect location names within a radius $r$ of a region by using Google Places API v3 ⁴, from latitude and longitude of acquired geo-tagged tweets. Then, our server database manages {Twitter user ID, icon URL, latitude, longitude, location name, tweet, word set, acquisition time} (central part of Fig. 2). Next, we determine tweets that are related to the detected location names, it is necessary to analyze the content of tweets and to filter out the tweets that have a low relation to their locations by a morphological analysis of nouns and adjectives. Therefore, we selected tweets that contain many feature terms (high-frequency words) describe locations. In particular, we acquire a total amount $n$ of tweets based on a given location, and calculate average frequency of each word $i$ that appears in each tweet $t$ . Moreover, we use a standard sigmoid function $1/(1+e^{-x})$ for weighting feature terms related to location names to increase the weights of them. $\begin{align*} &\sum_{i=1}^{m}\left(\frac{\#\text{tweets} \ \text{with}\ i} {n}\times\frac{1}{1+e^{-x}}\right)\times\frac{1}{m}\tag{1}\\ &x= \frac{\#\text{tweets} \ \text{with}\ i} {n}\tag{2} \end{align*}$ Graphic: System architecture.

Fig. 2.Fig. 2. System architecture.

Here, $m$ denotes the total number of words that appear in tweet $t$ . If Eq. (1) is more than a threshold value, $t$ is related to its location. $x$ as a DF value of $i$ is calculated by Eq. (2).

B. Web Page Analysis

For analyzing web pages, we first extract high-frequency nouns of web pages from snippets of the acquired URLs by using Yahoo! Web API ⁵. Next, we detect feature terms like location names from extracted high-frequency proper nouns by using a morphological analyzer, called JUMAN ⁶. Then, all location names in each web page are geocoded to latitude and longitude information by using Google Places API v3. Also, we extract categorize names of small-scale facilities, which are labeled manually referring to floor guide information of composite buildings in web pages.

C. Tweet Classification

In this work, we classify tweets based on category names of composite buildings from web page refer to 8 time periods are divided by each 3 hours of a day by adopting three classifiers, $k$ -NN ( $k$ -nearest neighbor algorithm), naïve Bayes classifier, and SVM (support vector machine).

$k$ -NN. It is a simple classification algorithm based on a similarity of a target data and a training data by using Euclidean distance. We extract nouns and adjectives from tweets and calculate a vector of each tweet by using the DF value of each word in the tweet with Eq. (2) into a target set. Also, we assign a class (category name) for each tweet into a training set. For fitting the DF values of all words in each tweet, if there are $q$ types of words appear in all tweets, vectors of each tweet are represented by an $q$ -dimensional space. The similarity of a target set $F$ and a training set $L$ is calculated using vectors of tweets as follows: $\begin{equation*} sim(F, L)=\sqrt{\sum\limits_{i=1}^{q}(F_{i}-L_{i})^{2}} \tag{3} \end{equation*}$

Therefore, we can extract the class of each training data with the highest similarity. Then, each target data is assigned to the most common class of its nearest training data by a majority vote. $\begin{equation*} {class}=\begin{cases} \quad j\quad where \ \{c_{j}\}=max\{c_{1},\ \ldots,\ c_{k}\}\\ reject \ where\ \{c_{i},\ \ldots,\ c_{j}\}=max\{c_{1},\ \ldots,\ c_{k}\} \end{cases} \end{equation*}$

Here, $k=8$ classes of the training data are acquired by Eq. (3).

Naïve Bayes classifier . It is a probabilistic classifier of the supervised learning algorithm. We calculate the probability of a training set of tweets and classes (category names). $\begin{equation*} P(C\vert W_{t})= \frac{P(C)P(W_{t}\vert C)}{P(W_{t})} \end{equation*}$

Here, $W_{t}$ extract a bag of words from each tweet, and $C$ denotes a set of classes. Therefore, a class of each tweet can be determined if it has the highest probability.

SVM is a supervised learning model for classification and regression analysis. We adopt a linear kernel for tweet classification with SVM, and it can classify the tweets by using a training set as well as a training set of $k$ -NN.

Based on the above, when Twitter users post tweets and a user browses a web page, the system can present tweets that are relevant based on a location name, and classify tweets based on category names from the web page in different time frames of a day. In this case, the database stores obtained tweets, obtained web pages, detected location names, and labeled category names (central part of Fig. 2).

IV. Visualization of Tweets

Twee Vist user interface has three parts in a Web browser: a web page browsing part on the top, a tag cloud with a time period selection bar and a chat box on the bottom left, and a tweet list on the bottom right (see Fig. 3). Users can easily grasp an overview or detailed information of events from tweets refer to both time and space while they browse web pages, and web users and Twitter users can also communication with each other in real time. Graphic: Tweevist user interface.

Fig. 3.Fig. 3. Tweevist user interface.

A. Tag Cloud Generation

For generating a tag cloud, we apply a tf-idf method to classified tweets in 8 time periods. We calculate tf-icf values of each word $i$ that appears in tweets by using tf and icf as follows: $\begin{align*} tf \ &= \ \frac{\# i \ \text{in} \ \text{each time} \ \text{period}}{\text{total} \ \#\text{words in} \ \text{each} \ \text{time period}}\\ icf \ &= \ \frac{\text{total} \ \#\text{categorizes in all}(=8) \ \text{time periods}}{\#\text{categorizes with}\ i} \end{align*}$

Therefore, we can generate a tag cloud of tweets by adjusting font sizes of feature words based on their tf-icf values. Although the position of words is important in a tag cloud, in this paper, we provide an intuitive interface by changing font sizes only in different time periods.

B. Visualizing Tweets Over Time and Space

Users can browse a web page and simultaneously obtain a tag cloud of tweets referring to all time periods and a list of streaming tweets which are related to the web page. Furthermore, users can freely specify a time period and click a tag to view its related tweets. As an example in Fig. 3, which depicts a user browsing an official website of Walt Disney World Resort in the Web browser, Twee Vist presents a tag cloud of tweets refer to all time periods (checked ALL TIME as a default) and a list of tweets sorted by the time (latest to earliest) which are related to the web page based on a location name, “Walt Disney World Resort,” the user can immediately gain a quick overview of Christmas events from the tag cloud, e.g., Star Wars, during the holiday season. Graphic: Users interact with twee vist.

Fig. 4.Fig. 4. Users interact with twee vist .

Fig. 4 shows user interactions on TweeVist , when the user checks a time period 7–11 , TweeVist then returns a tag cloud of tweets located on Walt Disney World Resort in the time period of 7:00–11:00, e.g., the font sizes of “mickey” and “minnie” are decreased, and the font sizes of “fine” and “pass” are increased around morning (7: 00-11: 00) from a tag cloud of tweets in all time periods (Fig. 3); and the user clicks a tag entrance , Twee Vist shows a list of most related tweets about “entrance” sorted by the time (latest to earliest), the user can easily obtain more detailed situation about entrance information of all theme parks from the tweet list.

V. Evaluation

A. Dataset

The dataset has been built retrieving 31.6 million geo-tagged tweets between 2015/7/13-12/17 of all Japan. In order to evaluate the accuracy of tweet classification based on floors when locations are composite facilities, we narrowed down the test dataset (totally 7,366 tweets during one month) in a large shopping mall “LUCUA Osaka” nearby Osaka station with a radius $r=200\mathrm{m}$ as shown in Table I. Table II shows category names based on floors which are extracted from the web page of LUCUA Osaka. Since some floors in a composite facility are the same genre, we grouped some floors into the same categories, e.g., 1F to 7F can be grouped into “Fashion,” and 9F and 10F can be grouped into “Lifestyle Goods.”

13 subjects identified if the tweets were related to categories or not as a training set. If the tweet was less related to its category, subjects gave a score of 1; if the tweet was related to its category, subjects gave a score of 2; if the tweet was strongly related to its category, subjects gave a score of 3; if the tweet was not related to its category, subjects gave a score of 0. Categories of tweets were defined if the average score of each tweet was the maximum value.

Table I Experimental dataset of LUCUA osaka

Table II Category names of floors in LUCUA osaka.

B. Accuracy of Tweet Classification Based on Floors

We compared the accuracies of tweet classification with $k$ -NN, naïve Bayes classifier, and SVM by calculating RMSE (root mean square error)⁷ of #categories and the average scores of tweets in each category of LUCUA Osaka, and precision to measure the relevance of tweets in each category of LUCUA Osaka. The RMSE values and the precisions of $k$ -NN are 0.096 and 0.513, respectively. The RMSE values and the precisions of naïve Bayes classifier are 0.311 and 0.124, respectively. And the RMSE value of SVM is 0.088. $k$ -NN and SVM are good results, however, many tweets are classified into “No Relation” by SVM, we need to remove them for training the learning data by using SVM. In particular $k$ -NN could identify adjectives, e.g., delicious, in the tweets. For example, a tweet “Very satisfied this amount in normal size at 650 yen! Yummy!!!” could be classified into “Restaurants.”

Through the whole results, several tweets are often wrongly categorized, when specific shop names or chain store names appeared in the tweets. For instance, tweets contain a chain store name “Umeda Store” of various categories, but they are wrongly classified into the category “Sweets.” Another problem is orthographic variants because Japanese could be written in both Kanji and hiragana.

C. Verification of Feature Words Changed Over Time

We calculated a tf-icf ranking of feature words of each category in each time period. In order to verify how about feature words changed over time, we compared top-15 high tf-icf words of each category in different time periods shown in Table III. Here, underlined words denote the feature words of a category appear in different time periods, and bold words denote the feature words are related to the category. The results and findings are shown as follows:

In category “Restaurants” (10F), the correlations between every two adjacent time periods becomes high from past noon (12:00–15:00).
In categories “Lifestyle Goods” (9F) and “Restaurants” (10F), the related feature words are increased from past noon (12:00–15:00). For example, no feature words about foods appear in the morning (06:00–09:00) because there are no tweets related to restaurants before opening.
In all categories except the category “Restaurants” (10F), we could confirm that the feature words are greatly changed in different time periods.

Table III TOP-15 HIGH tf-icf WORDS OF EACH category IN ALL TIME periods

In summary, there are low correlations of feature words of each category between every two adjacent time periods. Then, we confirmed that topics of tweets of each category are changed over different time periods, since many feature words of each category are different in different time periods.

VI. Conclusions

In this paper, we developed a novel geo-tweet visualization system (TweeVist ) to support users grasp events over time and space from tweets through both the contents of tweets and web pages based on spatio-temporal analysis. TweeVist maps tweets to web pages by matching location names, and classifies the tweets in different time frames of a day based on category names of floor information from web pages. Experimental results show that TweeVist can effectively map related tweets to web pages in different time periods, and it also can present a summary of tweets in a tag cloud, to help users gain a quick overview of current situation about each category (floor) of composite facilities.

For future work, we plan to enhance TweeVist based on experimental results and verification experiments will be carried out for many types of composite facilities in different time periods with many more subjects. Furthermore, we will evaluate the usability of users viewing tweets and their summary information associated with web pages through TweeVist .

Footnotes

1 https://dev.twitter.com/streaming/overview

2 https://Twitter.com/@RtQAService

3 https://html.spec.whatwg.org/multipage/comms.html#network

4 https://developers.google.com/place

5 http://developer.yahoo.co.jp/

6 http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN

7 https://www.kaggle.com/wiki/RootMeanSquaredError

Acknowledgment

This work was partially supported by JSPS KAKENHI Grant Numbers 26280042, 15K00162, 16H01722.

References

[1]A. Cui, M. Zhang, Y. Liu, S. Ma, and K. Zhang. Discover breaking events with popular hashtags in twitter. In CIKM2012, pages 1794–1798.
[2]T. M. Ghanem, A. Magdy, M. Musleh, S. Ghani, and M. F. Mokbel. Viscat: Spatio-temporal visualization and aggregation of categorical attributes in twitter data. In SIGSPATIAL2014, pages 537–540.
[3]C.-H. Lee. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications, 39(10): 9623–9641, 2012.
[4]C. Li, A. Sun, and A. Datta. Twevent: Segment-based event detection from tweets. In CIKM2012, pages 155–164.
[5]A. Magdy, L. Alarabi, S. Ai-Harthi, M. Musleh, T. M. Ghanem, S. Ghani, and M. F. Mokbel. Taghreed: A system for querying, analyzing, and visualizing geotagged microblogs. In SIGSPATIAL2014, pages 163–172.
[6]B. O'Connor, M. Krieger, and D. Ahn. Tweetmotif: Exploratory search and topic summarization for twitter. In ICWSM2010, pages 384–385.
[7]S. Poomagal, P. Visalakshi, and T. Hamsapriya. A novel method for clustering tweets in twitter. International Journal of Web Based Communities, 11(2): 170–187, 2015.
[8]A. Ritter, O. Etzioni, S. Clark, Open domain event extraction from twitter. In SIGKDD2012, pages 1104–1112.
[9]T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for realtime event detection and earthquake reporting system development. IEEE Transactions on Knowledge and Data Engineering, 25(4): 919–931, 2013.
[10]Y. Wang, G. Yasui, Y. Hosokawa, Y. Kawai, T. Akiyama, and K. Sumiya. Twinchat: A twitter and web user interactive chat system. In CIKM2014, pages 2045–2047.
[11]Y. Yamaguchi, T. Amagasa, H. Kitagawa, and Y. Ikawa. Online user location inference exploiting spatiotemporal correlations in social streams. In CIKM2014, pages 1139–1148.
[12]G. Yasui, Y. Wang, Y. Hosokawa, Y. Kawai, T. Akiyama, and K. Sumiya. A simultaneous user communication system between microblogs and web pages [in japanese]. DBSJ Japanese Journal, 13-J(2): 7–12, 2015.