the assumption that most modification operators touch only
a small amount of the total changes, we construct two sub-
queries that return the removed and added changes. By
simply linking the original change table by EXCEPT or UNION
with the two subqueries, we obtain the modified cube.
This modified cube may contain changes that are either
(i) inconsistent or (ii) redundant. Two changes are incon-
sistent, if they both set the value for the same property of
the same entity at the same time and do not agree on that
value. So they violate the uniqueness-constraint mentioned
in Section 2. For now, we rely again on the database to de-
tect those inconsistencies through grouping the changes by
time, id, and property and counting the number of distinct
values. If such inconsistencies are detected, we require the
user to interactively resolve them by one of several options.
Either the distinct values are concatenated, or one of the
values is preferred, i.e., either the newly added changes have
a higher priority or vice versa.
For redundant changes, no user input is required. Two
changes are redundant, if they set the value of the same
property of the same entity at two distinct points in time
with no change of that entity/property-combination in be-
tween. If this is detected, the tool simply removes the second
change from the change-cube. For detection, DBChEx uses
a combination of SQL and backend code. The SQL query
generates a list of possibly redundant change candidates and
the backend code then checks whether they are really redun-
dant.
Performance optimizations. We employ a number of
optimizations to make this approach more efficient: first
the operator sequences are normalized through a number of
rules. For example, if an operator sequence contains multi-
ple consecutive filter operators, the order of those filters does
not matter and they can be arranged in a fixed order. This
normalization is important for the prioritized cache that the
tool uses to avoid reoccurring calculations. Through this
normalization more operator sequences are known to deliver
the same output and can therefore rely on the cache. In
the example above, the result of the subquery that retrieves
the ten most changed entities can easily be cached. For our
test datasets this results in a satisfactory (subsecond) perfor-
mance. Still, for larger datasets a more specialized solution
with custom index structures could become necessary.
Analytical services. The analytical services receive a
SQL query as input, which corresponds to the current oper-
ator sequence. By executing this query on the database, the
analytical service can retrieve those change records, which
form the basis of the analysis. Our analytical services are
currently implemented in Scala and Spark, but of course
other languages and frameworks can be used as well. Once
the analysis is complete, the service writes the results back
to the database and notifies the backend via an HTTP call.
The latter can then inform the user, who can then proceed
with inspecting the results. In addition to clustering, other
potential analytical services could include outlier detection
or an evaluation of individual changes in terms of quality or
trustworthiness.
4. DEMONSTRATION
This section describes two interactive exploration scenar-
ios using DBChEx on two different datasets: IMDB and
Wikipedia infoboxes. For the Wikipedia dataset, the user
is able to explore vandalism and edit wars. The dataset
also contains a large number of genuine data changes and
also schema changes that are to be discovered, such as dis-
tinct infobox templates that are merged or attributes that
are renamed. For IMDB, the user can also observe schema-
changes, but much less frequently. IMDB is an example that
has attributes of highly different volatility. For instance, the
number of votes on a movie (numVotes) have a high volatil-
ity in contrast to its primaryTitle. In the future we plan
to make DBChEx an open-source tool and also provide our
datasets in an accessible way.
4.1 Exploring Wikipedia infobox changes
The DBChEx project homepage provides a short overview
as well as two demonstration videos on the Wikipedia data-
set of changes to settlement infoboxes
4
. In the first video,
Alice focuses through clicking on the highly-volatile entity
Chicago and thereby slicing. By inspecting the value do-
main, she detects that many changes contain the former
mayor of Chicago Richard M. Daley in the value domain.
As she had only expected one such change after his election
in 2011, but instead there are 176 such changes. Through
clicking on his name, she applies a filter to see only those
changes and realizes that only the property leader name of
Chicago was changed to that value. She continues to inspect
all changes to that property and finds – besides vandalism – a
high disagreement among users on whether the leader name
should be updated after the mayoral election or after the
inauguration.
Alice shares the URL of her findings to Bob, so he di-
rectly sees all the change-cubes Alice saw in her last step.
He gets curious and wants to find out what other changes re-
late to Chicago. In the second video, Bob follows the traces
of Chicago again, but unlike Alice he focuses on changes
that contain Chicago in the value dimension. Here he finds
a lot of changes on the same day that update the sub-
division name3 of various locations in Chicago. At this
point the following feature of DBChEx can help Bob: Once
the user has found an interesting change, the tool provides
a dataset-specific link back to the original source of the
change. For Wikipedia, the tool provides context informa-
tion (user, comment) and a link to the diff-page of the re-
vision, while for other datasets the tool could for example
show the relevant SQL INSERT/UPDATE statement. This fea-
ture greatly helps to understand the intentions of a certain
change. In this case, some further investigation reveals that
on that day two infobox templates (community area and
settlements) were merged.
4.2 Exploring IMDB changes
Figure 5 gives a short overview of a small exploration sce-
nario on IMDB, for which we have gathered 47 daily snap-
shots. Assume that Alice first performs a split by property,
which results in one change-cube per property as shown in
Figure 5a. For each of the change-cubes, the tool displays a
number of statistics, for example the distribution of changes
over time. A large spike of changes for the properties episo-
deNumber and seasonNumber on 2018-02-02 catches Alice’s
eye. For further inspection, she prunes all other change-
cubes and keeps only those two change-cubes for both prop-
erties. Furthermore, by clicking on that timestamp, she
4
https://hpi.de/naumann/projects/
data-profiling-and-analytics/dbchex.html