FREQUENTLY ASKED QUESTIONS

Q: What is PopulusLog? 

PopulusLog is a web data mining research project developed by two members of the Database Lab at Case. This project is not officially affiliated with Case Western Reserve University, and is conducted solely for research purposes.

Q: Does PopulusLog violate privacy of personal information?  

No, PopulusLog only presents information that is collected from publicly available web pages (see the question after the next one to read about how this is done). The information (email, department, etc.) obtained from Case phone book database is only available to people who have a Case network id just as it is in Case Directory. And, the other information such as affiliated organizations, locations, etc. is collected from the publicly available web pages that are indexed by search engines like Google. Hence, PopulusLog does not contain any information about the individuals, that is more private than what is already available on publicly accessible web pages. Furthermore, people can remove their records from PopulusLog database any time they want (see the next entry).

Q: Can I remove my record from PopulusLog Case Edition? 

Yes, certainly. You can opt out of the PopulusLog database by first locating your record, then clicking on the link "Remove your record from Populuslog" towards the end of the page. In order to remove your record you need to have a Case network id that is authenticated through Case Single Sign-on service implemented by the university.

Q: How is the data collected? 

Names and the other contact information is collected from the Case phone book database. The other information is extracted from the publicly accessible web pages in an automated manner through the application of several data mining techniques. By no means do we guarantee the correctness and the completeness of the collected data. This project and the associated data is maintained for experimental purposes.

Web Search for User Pages

We implemented a search engine crawler that queries a search engine (i.e., Yahoo, MSN, Google, A9 and MetaCrawler) and stores the top k results of the query in our database. Our search engine crawler efficiently creates, reuses and concurrently handles search engine threads and is capable of retrieving and parsing hundreds of search engine results per second. However, most search engines can detect and block this kind of bulk queries to force people to use their web service toolkits.

Google web service toolkit is limited to 1000 queries per day and per license. Even though getting a license from Google is free, we need to get a license for each running instance of our crawler and 1000 queries is too few to query all people in our database. Yahoo toolkit provides better facilities. It lets a user to query their search engine 5000 times a day per machine IP addresses. Yahoo lets us to use the same license for all instances of our crawler; however, 5000 is still a small number to process all names in our database. Therefore, we had to use the web interface of the search engines.

Among the search engines that we analyzed, Google servers give the fastest response to our queries. However, Google is the toughest in terms of preventing bulk queries. Google returns HTTP 999 error code after 100 queries coming from the same IP in the same hour and we need to go to Google website and unlock the machine IP by entering a special code into a form. Yahoo and MSN do the same blocking but they unlock the blocked machine IP automatically after 10 minutes. The only search engine that allows bulk queries is MetaCrawler, which queries and combine results from several search engines. We finished querying the whole name list in less than an hour by using MetaCrawler search engine. Using our search engine crawler, we collected the links to web pages that contain a person name. We found 28927 links for 21636 people (a webpage may contain names of multiple people) in CASE domain. There was no search engine result for the rest of 42852 people.

We also implemented a multithreaded crawler to download the web pages referred by the links collected by the search engine crawler. We downloaded 24211 web pages in less than 20 minutes (since all links are in case domain and we used a machine in case domain) and found that 4716 of the links were dead.

Data Extraction

We described how we collect related web pages of people using a search engine in the previous section. At this stage, we have a set of documents ranked (w.r.t. the results of the search engine queries) and associated with people. In another words, the documents are search engine results of person queries. We extract the associated entities of people from the content of the web pages. We used two different named entity taggers. Named entity taggers are software tools that can detect entity names such as human, location, organization, etc. names in an automated manner within a text document like a web page. 

Our previous aim was using sentence splitters and co-reference finders to associate people with the entities in text. However, when we observed that essential information about people usually had no paragraph structure and distributed over the page in form of item lists (e.g., information in CVs, resumes). Thus, we used statistical information to find such associations instead of language processing.

Our statistical analysis consists of two parts; (i) distinguishing useful tags and (ii) finding relationships between entities.

Home Page Location

PopulusLog relies primarily on redundancy of information on the web, hence, does not employ advanced information extraction techniques, but follows simple heuristics for information compilation. Despite its simplifying effect on data collection, information redundancy also introduces some challenges in disambiguating the person to be associated with the information on a web page containing multiple people. Hence, it is crucial to extract the data about a person from the web page(s) that focuses the most on the person. To this end, despite some exceptions, personal home pages usually provide the most focused and complete information about an individual. Current PopulusLog implementation relies on the premise that, when searched by a person name on a search engine, the first resulting page would be his/her personal home page. However, our experience with the current data stored in PopulusLog indicates that there are many cases that this premise does not hold. Moreover, even though a person has no personal home page, it is possible to locate information about this person on other pages. In such cases, the most informative pages should be utilized as information sources. To this end, our approach involves exploiting supervised learning methods, namely, classification and rule-based techniques.

We developed three different classifiers, namely, SVM-based classifier, Frequent Item Set based classifier, and Rule-Based Classifier. Each of the classifiers is applied on existing data, then, an overall confidence score is computed to decide the degree of informativeness of a web page. We compute the informativeness of a web page as the probability that the web page is a home page of a person. Hence, higher home page probability is interpreted as high degree of informativeness for a given web page.

Locating Similar People

People may have similarities in terms of entities or the relationships that they are found to have. PopulusLog Case edition employs three different similarity schemes, and combines outcomes of each similarity computation to compute an overall similarity score between the people. The similarity is computed based on the number of shared locations, affiliations, and the web documents that two given person appears on. We used Jaccard measure to compute each similarity, and then take the average of each similarity result. For instance, location similarity is the ratio of shared locations to the set of all locations associated with either of the two people.

Virtual Social Networks

According to extracted person-person relations, we compute an impact factor for each person. The relationships between people are directed and weighted which shows the strength of the relationship between two people. As before, we used PageRank algorithm to compute the impact factors. In this project, our contribution over the existing version is the assignment of weights to the relationships, and consideration of these relationship weights during impact factor computation.

Furthermore, we also provide a section in the person detail page to give an overview of social network of a person. This section includes three main parts, namely, people who knows the person well (known by), people well known by the displayed person (knows), and friends of the person. The first two relationship “knows” and “known by” are directly obtained from the person-person relationships extracted from web pages by our crawler. However, the “friends” relationship is computed and inferred implicitly from the existing relationship. In real life if two people are friends, they know each other very well, at least, better than the others do. Furthermore, the relationship is reciprocal meaning that both people know each other at similar strengths. In this sense, the friendship relation is different than the “being a fan of somebody” in that “fan of” relation is most of the unidirectional where a popular person is overly well known by many others although the popular person, most of the time, is not aware of the majority people who knows him very well. Based on this intuition, we select top-k people as the friends of a person who knows him/her very well, and h/she knows them well and the strength of relation between them is similar, that is, there is no huge difference in different directions of a relationship between two friends. In order to keep the presentation concise, and eliminate false positive, we only display top-k people for each social network relationship.

 

     


Home | Search for People  | Browse People  | Visualize Social Networks | Top Lists | Contribute | About | Control Panel | Frequently Asked Questions     

Copyright©2006-2009 PopulusLog People Information Database
Case Western Reserve University Special Edition

Disclaimer: This project is not officially affiliated with Case Western Reserve University, and is conducted solely for research purposes.  The data is collected through completely automated means, and we assume no responsibility on the correctness or the completeness of the presented data. For more information, please see the "Frequently Asked Questions" page.