Web
Search for User Pages
We
implemented a search engine crawler that queries a search engine (i.e., Yahoo,
MSN, Google, A9 and MetaCrawler) and stores the top k results of the query in
our database. Our search engine crawler efficiently creates, reuses and
concurrently handles search engine threads and is capable of retrieving and
parsing hundreds of search engine results per second. However, most search
engines can detect and block this kind of bulk queries to force people to use
their web service toolkits.
Google
web service toolkit is limited to 1000 queries per day and per license. Even
though getting a license from Google is free, we need to get a license for each
running instance of our crawler and 1000 queries is too few to query all people
in our database. Yahoo toolkit provides better facilities. It lets a user to
query their search engine 5000 times a day per machine IP addresses. Yahoo lets
us to use the same license for all instances of our crawler; however, 5000 is
still a small number to process all names in our database. Therefore, we had to
use the web interface of the search engines.
Among
the search engines that we analyzed, Google servers give the fastest response
to our queries. However, Google is the toughest in terms of preventing bulk
queries. Google returns HTTP 999 error code after 100 queries coming from the
same IP in the same hour and we need to go to Google website and unlock the
machine IP by entering a special code into a form. Yahoo and MSN do the same
blocking but they unlock the blocked machine IP automatically after 10 minutes.
The only search engine that allows bulk queries is MetaCrawler, which queries
and combine results from several search engines. We finished querying the whole
name list in less than an hour by using MetaCrawler search engine. Using our
search engine crawler, we collected the links to web pages that contain a
person name. We found 28927 links for 21636 people (a webpage may contain names
of multiple people) in CASE domain. There was no search engine result for the
rest of 42852 people.
We also
implemented a multithreaded crawler to download the web pages referred by the
links collected by the search engine crawler. We downloaded 24211 web pages in
less than 20 minutes (since all links are in case domain and we used a machine
in case domain) and found that 4716 of the links were dead.
Data
Extraction
We
described how we collect related web pages of people using a search engine in
the previous section. At this stage, we have a set of documents ranked (w.r.t.
the results of the search engine queries) and associated with people. In
another words, the documents are search engine results of person queries. We
extract the associated entities of people from the content of the web pages. We
used two different named entity taggers. Named entity taggers are software
tools that can detect entity names such as human, location, organization, etc.
names in an automated manner within a text document like a web
page.
Our
previous aim was using sentence splitters and co-reference finders to associate
people with the entities in text. However, when we observed that essential
information about people usually had no paragraph structure and distributed
over the page in form of item lists (e.g., information in CVs, resumes). Thus,
we used statistical information to find such associations instead of language
processing.
Our
statistical analysis consists of two parts; (i) distinguishing useful tags and
(ii) finding relationships between entities.
Home
Page Location
PopulusLog
relies primarily on redundancy of information on the web, hence, does not
employ advanced information extraction techniques, but follows simple
heuristics for information compilation. Despite its simplifying effect on data
collection, information redundancy also introduces some challenges in
disambiguating the person to be associated with the information on a web page
containing multiple people. Hence, it is crucial to extract the data about a
person from the web page(s) that focuses the most on the person. To this end,
despite some exceptions, personal home pages usually provide the most focused
and complete information about an individual. Current PopulusLog implementation
relies on the premise that, when searched by a person name on a search engine,
the first resulting page would be his/her personal home page. However, our
experience with the current data stored in PopulusLog indicates that there are
many cases that this premise does not hold. Moreover, even though a person has
no personal home page, it is possible to locate information about this person
on other pages. In such cases, the most informative pages should be utilized as
information sources. To this end, our approach involves exploiting supervised
learning methods, namely, classification and rule-based techniques.
We
developed three different classifiers, namely, SVM-based classifier, Frequent
Item Set based classifier, and Rule-Based Classifier. Each of the classifiers
is applied on existing data, then, an overall confidence score is computed to
decide the degree of informativeness of a web page. We compute the
informativeness of a web page as the probability that the web page is a home
page of a person. Hence, higher home page probability is interpreted as high
degree of informativeness for a given web page.
Locating
Similar People
People
may have similarities in terms of entities or the relationships that they are
found to have. PopulusLog Case edition employs three different similarity
schemes, and combines outcomes of each similarity computation to compute an
overall similarity score between the people. The similarity is computed based
on the number of shared locations, affiliations, and the web documents that two
given person appears on. We used Jaccard measure to compute each similarity,
and then take the average of each similarity result. For instance, location
similarity is the ratio of shared locations to the set of all locations
associated with either of the two people.
Virtual
Social Networks
According
to extracted person-person relations, we compute an impact factor for each
person. The relationships between people are directed and weighted which shows
the strength of the relationship between two people. As before, we used
PageRank algorithm to compute the impact factors. In this project, our
contribution over the existing version is the assignment of weights to the
relationships, and consideration of these relationship weights during impact
factor computation.
Furthermore,
we also provide a section in the person detail page to give an overview of
social network of a person. This section includes three main parts, namely,
people who knows the person well (known by), people well known by the displayed
person (knows), and friends of the person. The first two relationship “knows”
and “known by” are directly obtained from the person-person relationships
extracted from web pages by our crawler. However, the “friends” relationship is
computed and inferred implicitly from the existing relationship. In real life
if two people are friends, they know each other very well, at least, better
than the others do. Furthermore, the relationship is reciprocal meaning that
both people know each other at similar strengths. In this sense, the friendship
relation is different than the “being a fan of somebody” in that “fan of”
relation is most of the unidirectional where a popular person is overly well
known by many others although the popular person, most of the time, is not
aware of the majority people who knows him very well. Based on this intuition,
we select top-k people as the friends of a person who knows him/her very well,
and h/she knows them well and the strength of relation between them is similar,
that is, there is no huge difference in different directions of a relationship
between two friends. In order to keep the presentation concise, and eliminate
false positive, we only display top-k people for each social network
relationship.