Sitemap Contact Us
When: Back to Calendar May 10, 2013 @ 10:00 am - 12:00 pm

 Student Name: Changhyun Byun

Data Collection In A Social Network With Weighted Seed Selection And Data Analysis Based On Rule-Based Methods


In recent years, with the increasing popularity of diverse online social network sites, such as Facebook, Twitter, Blogger, YouTube, LinkedIn, and MySpace, a massive amount of data has become available. Analyzing sets of data in social media can lead to some understanding of individual and human behavior, detection of hot topics, identification of influential people, or discovery of a group or community. However, it is difficult to discover useful information from social data without automated information processing because of three main characteristics of social media data sets: the data is large, noisy, and dynamic. In order to overcome these challenges of social media, data-mining techniques can be used by data seekers to discover a diversity of perspectives that would otherwise not be possible.

To apply data-mining techniques to social data, the target data set must be prepared from social networks before the analyzing process. For these reasons, Twitter enables researchers and data analyzers to access a variety of data in Twitter by providing Application Programming Interface (API). However, there is a restriction on data collection from Twitter: the method call of Twitter API is limited. Furthermore, it is impossible to collect enough data to apply data analysis techniques and filter out unnecessary data, such as spam messages without an automated data collector and filter. In order to overcome these data access problems, we aim to design and implement our own Twitter data-collecting tool, which includes data filtering and analysis capabilities. This allows us, as well as other researchers and data seekers, to build their own Twitter dataset.

First, in this research, we introduce the design specifications and explain the implementation details of the Twitter Data Collecting Tool we developed. To introduce and explain the implementation details and the design specifications of the Twitter Data Collecting Tool, the Unified Modeling Language (UML) diagram is used.

We next propose a new algorithm that selects the best seed nodes with limited resources and time to collect the data related to a specific topic and keyword efficiently. The algorithm also evaluates various user influence and activity factors, and updates the seed nodes dynamically during the gathering process. After the gathering process, we compared two results, one from this algorithm and one from a specialist.

In the final chapter, we provide an analysis of Twitter data gathered by the Twitter Data Collecting Tool in a case study about the Super Bowl 2012 and Super Bowl 2013. The case study aims to address the question of how people use Twitter and to assess the power of Twitter in creating consumer interest in brands and commercials. The main objective of this study is to find the relationship between Twitter and Super Bowl advertisements by analyzing data on Twitter.

This research shows that the Twitter Data Collecting Tool allows researchers to gather users’ information, follow relationships and tweets from Twitter. Furthermore, the data collection result with the seed selection algorithm proved that the efficiency of the algorithm for collecting more keyword-related data is higher than the existing approach. In addition, data-mining techniques and rule-based data analysis are applied to the gathered data. With these results, we could prove that the Twitter Data Collecting Tool is able to gather a huge amount of data from Twitter and filter the data so it can be used in research areas. This paper will be valuable to those who may want to build their own Twitter dataset, apply customized filtering options to get rid of unnecessary, noisy data, and analyze social data to discover new knowledge.

  •  Committee Chair: Dr. Yanggon Kim
  •  Committee Members: Dr. Josh Dehlinger , Dr. Sungchul Hong, and Dr. Siddharth Kaza
  •  Date: 10 a.m. on Friday, May 10, 2013
  • Location: YR-459, 4th floor COSC Conference Room