to VizLinc’s graph navigation and clustering features, we can
see that clusters b elong to different categories. Politicians,
artists and sports personalities all have their own clusters.
The word cloud shows the 50 most salient terms in the data
set. Not surprisingly, locations New York, United States,
New York City, Iraq, Manhattan and Washington are heav-
ily mentioned. Organizations like Congress, Senate, Yankees
(New York Yankees), and Google also form part of the list
of most mentioned entities.
We will rely on a hypothetical use case and potential ac-
tion path to illustrate VizLinc search capabilities on the New
York Times data set. Let us say that we are interested in
elections around the world. A first approach would be to
do a search for the string “elections”. The working set de-
creases from nearly 40,000 to 834 documents that contain
that string. The list of documents can now b e sorted by the
total number of mentions and we could browse the contents
of the top hits. Instead, we will take a look at the map and
location list to see what locations co-occur the most with the
term “elections”. The reader should keep in mind that, after
executing a query, all views are updated to show different
visualizations of the content of the matching do cument set
only i.e., the new working set. The entity list in the Search
view shows that “Iraq”, “United States”, and “Israel” are the
most mentioned locations in conjunction with “elections”.
Examining the map shows activity in many other parts of
the world including the major countries in South America.
Let us say that Venezuela piques our interest, so we add
Location:Venezuela to the query from the map view.
The working set now contains 12 documents that could be
browsed within minutes if so desired. The resulting graph
shows the people mentioned in these documents and it is
much more suitable for visual analysis than the original one.
To get a sense of the importance of each individual in
the working set as described by the co-occurrence relation
defined in previous sections, we re-size the nodes in the graph
according to their centrality score. Also, we can cluster this
new sub-graph to reveal any community structures present.
Figure 10 shows part of the resulting graph.
The graph suggests that one of the most central people is
Hugo Chavez. Hugo Chavez was the president of Venezuela
in 2007 and had been re-elected the previous year. This
is not new information but it demonstrates VizLinc’s abil-
ity to find central people with respect to some user-defined
context. Clustering resulted in three major communities;
a subset is shown in Figure 10. Upon examination, it can
be noticed that the three clusters illustrated group three
different types of actors: USA political and media figures
(top-left), South American political figures (top-right) and
artists (bottom-left). From this point on, if we were inter-
ested in the sentiment and opinions of U.S. politicians to-
wards the government of Venezuela we could add members
of that community to the query. If what is relevant to us is
stories about South American leaders and the government of
Venezuela we would add members of the second community
to my query and examine the resulting documents. Authors
and entertainers appear in the graph due to spurious co-
occurrences in articles that contain lists spanning a variety
of unrelated topics.
8. CONCLUSIONS AND FUTURE WORK
In this paper, we have introduced VizLinc and described
how it combines information extraction, graph analysis, and
geo-location for visualization and exploration of text cor-
pora. We have also presented a case study, centered on a
compilation of articles from the New York Times, to demon-
strate VizLinc’s features.
Now that we have achieved our principal goal of creating
a complete framework for data ingestion, visualization, and
exploration, our future work will focus on making each com-
ponent more generic and robust. Modules such as the ones
that generate the graph and perform coreference resolution,
yielded reasonable results on the data for which VizLinc was
initially intended. However, these mo dules turned out to be
rather simplistic for most of the text genres we have tested
so far. Multiple-term searches could also be improved by
restricting the distance at which both terms can appear in
a document. This will avoid documents in which terms co-
occur but are in fact unrelated. Expanding queries to sup-
port “and” and “or” operations is also a subject for future
work. With a platform in place, we can now take a task cen-
tric approach and assess whether the techniques and user in-
teractions VizLinc enables are appropriate to the successful
completion of a particular task. Finally, we understand that
user-defined algorithms and entity types will be required to
analyze certain data sets efficiently. Therefore we would like
to include a mechanism that would allow users to add these
custom components with ease.
9. REFERENCES
[1] M. Bastian, S. Heymann, M. Jacomy, et al. Gephi: an
open source software for exploring and manipulating
networks. ICWSM, 8:361–362, 2009.
[2] T. Boudreau, J. Tulach, and R. Unger. Decoupled
design: building applications on the netbeans
platform. In Companion to the 21st ACM SIGPLAN
symposium on Object-oriented programming systems,
languages, and applications, pages 631–631. ACM,
2006.
[3] J. R. Finkel, T. Grenager, and C. Manning.
Incorporating non-local information into information
extraction systems by gibbs sampling. In Proceedings
of the 43rd Annual Meeting on Association for
Computational Linguistics, pages 363–370. Association
for Computational Linguistics, 2005.
[4] J. R. Harger and P. J. Crossno. Comparison of
open-source visual analytics toolkits. In IS&T/SPIE
Electronic Imaging, pages 82940E–82940E.
International Society for Optics and Photonics, 2012.
[5] A.
¨
Ozg
¨
ur, B. Cetin, and H. Bingol. Co-occurrence
network of reuters news. International Journal of
Modern Physics C, 19(05):689–702, 2008.
[6] L. Page, S. Brin, R. Motwani, and T. Winograd. The
pagerank citation ranking: Bringing order to the web.
1999.
[7] M. A. Rodriguez and P. Neubauer. The graph
traversal pattern. arXiv preprint arXiv:1004.1001,
2010.
[8] M. Rosvall and C. T. Bergstrom. Maps of random
walks on complex networks reveal community
structure. In Proceedings of the National Academy of
Sciences, page 1118, 2001.
[9] E. Sandhaus. The new york times annotated corpus
ldc2008t19. Linguistic Data Consortium, 2008.
[10] B. Wright, J. Payne, M. Steckman, and S. Stevson.