What can we learn from the last 200 million things that happened in the
world?
Posted By Joshua Keating Wednesday, April 10, 2013 - 1:43 PM Share
If you've been following the political science geek Twitter/blogosphere for
the last few days, you've probably come across the mysterious acronym GDELT.
The excitement over Global Data on Events, Location, and Tone - to give its
full name -- is understandable. The singularly ambitious project could have
a transformative effect on how we use data to understand and anticipate
political events.
Essentially, GDELT is a massive list of important political events that have
happened -- more than 200 million and counting -- identified by who did what
to whom, when and where, drawn from news accounts and assembled entirely by
software. Everything from a riot over food prices in Khartoum, to a suicide
bombing in Sri Lanka, to a speech by the president of Paraguay goes into the
system.
Similar event databases have been built for particular regions, and DARPA
has been working along similar lines for the Pentagon with a project known
as ICEWS, but for a publicly accessible program (you can download it here
though you'll need some programming skills to use it) GDELT is unprecedented
in it geographic and historic scale. The database updates with new events
every night following the day's news and while it currently goes back to
1979, its developers are working on adding events going back as far as 1800
according to lead author Kalev Leetaru, a fellow at the University of
Illinois Graduate School of Library and Information Science. (I've
previously written about his work here.)
"It's the sheer size," says Leetaru, when asked what makes the project
unique. " And the resolution. It's not just saying an event took place in
Syria. It's saying who did what to whom. It will tell us that it was the
military who attacked Christian civilians in this city on this day. If the
article says it was worshippers who were attacked in their church, that will
all be captured."
Events are classified by four different types: material conflict, material
cooperation, verbal conflict, and verbal cooperation. Within those
categories, events are classified using a 300 category taxonomy system
called CAMEO, developed by Penn State's Philip A. Schrodt, to provide detail
on the actors and the action that occurred.
For instance, an event like "Students and police fought in the Egyptian
capital" will be coded as "EDU fought COP," and also include the location
and time when the event took place.
"We're already hard at work on a new version that will expand this
dramatically, adding everything from disease to different classes of
political transitions, and things like cyberwarfare," says Leetaru, noting
that new types of events have increased in importance in the decade since
the CAMEO system was developed. Geopolitically important financial events
may also soon be included.
So what can we do with all this? Well for one thing it could be an extremely
powerful tool for researchers looking to track political events over time,
and even predict them in the future. One early paper by Penn State PhD.
candidate James Yonamine uses GDELT data to track patterns of violence in
different districts of Afghanistan:
GDELT could also be used to study political rhetoric, for instance the kind
of statements that politicians make in the run-up to war in order to prime
their citizens for conflict.
Leetaru sees even broader applications for researchers using a branch of
mathematics known as complexity theory (closely related to chaos theory) to
identify global patterns in seemingly random human events. "Most datasets
that measure human society, when you plot them out, don't follow these nice
beautiful curves," he says. They're very noisy because they reflect reality.
So mathematical techniques now let us peer through that to say, what are the
underlying patterns we see in all this."
Of course, for all the high-tech software behind its creation and its
potentially far-out applications, GDELT is, at its core, a way of
summarizing news coverage, and old fashioned legacy-media news coverage at
that. The sources used to identify events include world news coverage from
Agence France Press, the AP, BBC, Christian Science Monitor, New York Times,
UPI, and the Washington Post, as well as a few more specialized outlets and
Google News. Leetaru notes in his recent paper introducing the project that
the increasing availability of news on the web has led to a "dramatic
increase [of recorded events] since the beginning of the 21st century."
This leads to another potential problem: that the frequency of recorded
events in a given region will be less correlated to the actual frequency of
them occurring than the frequency of the international media covering them.
A politically motivated shooting in Syria or the West Bank this month will
probably be recorded. In a rural region of the Congo or Central African
Republic? It's harder to say.
Leetaru says he's looking into supplementing some of the journalistic data
with information from social media driven projects like Ushahidi. "As
quality journalism is under attack from all sectors, whether that's
government stepping up efforts to squelch it or the collapsing economics of
it, we're starting to look at all the citizen journalism that's out there,"
he says. "One of the reasons we're focusing on mainstream journalism is that
social media is a relatively new phenomenon."
Leetaru notes that he's been cautious to integrate social media data because
of difficulties with quality control and verification. "Journalism's not
perfect either, but at least there's that professional code of ethic," he
says.
So whether or not data kills theory in the social sciences, someone still
needs to get the information in the first place.
==========================================
(F)AIR USE NOTICE: All original content and/or articles and graphics in this
message are copyrighted, unless specifically noted otherwise. All rights to
these copyrighted items are reserved. Articles and graphics have been placed
within for educational and discussion purposes only, in compliance with
"Fair Use" criteria established in Section 107 of the Copyright Act of 1976.
The principle of "Fair Use" was established as law by Section 107 of The
Copyright Act of 1976. "Fair Use" legally eliminates the need to obtain
permission or pay royalties for the use of previously copyrighted materials
if the purposes of display include "criticism, comment, news reporting,
teaching, scholarship, and research." Section 107 establishes four criteria
for determining whether the use of a work in any particular case qualifies
as a "fair use". A work used does not necessarily have to satisfy all four
criteria to qualify as an instance of "fair use". Rather, "fair use" is
determined by the overall extent to which the cited work does or does not
substantially satisfy the criteria in their totality. If you wish to use
copyrighted material for purposes of your own that go beyond 'fair use,' you
must obtain permission from the copyright owner. For more information go to:
http://www.law.cornell.edu/uscode/17/107.shtml
THIS DOCUMENT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS
PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.
No comments:
Post a Comment