Analysing and strengthening OpenWPM’s reliability
Benjamin Krumnow
TH Köln,
Open University Netherlands
Hugo Jonker
Open University Netherlands,
Radboud University
Stefan Karsch
TH Köln
Abstract
Automated browsers are widely used to study the web at scale.
Their premise is that they measure what regular browsers
would encounter on the web. In practice, deviations due to
detection of automation have been found. To what extent
automated browsers can be improved to reduce such devia-
tions has so far not been investigated in detail. In this paper,
we investigate this for a specific web automation framework:
OpenWPM, a popular research framework specifically de-
signed to study web privacy. We analyse (1) detectability of
OpenWPM, (2) prevalence of OpenWPM detection, and (3)
integrity of OpenWPM’s data recording.
Our analysis reveals OpenWPM is easily detectable. We
measure to what extent fingerprint-based detection is already
leveraged against OpenWPM clients on 100,000 sites and
observe that it is commonly detected (
14% of front pages).
Moreover, we discover integrated routines in scripts to specif-
ically detect OpenWPM clients. Our investigation of Open-
WPM’s data recording integrity identifies novel evasion tech-
niques and previously unknown attacks against OpenWPM’s
instrumentation. We investigate and develop mitigations to
address the identified issues. In conclusion, we find that re-
liability of automation frameworks should not be taken for
granted. Identifiability of such frameworks should be studied,
and mitigations deployed, to improve reliability.
1 Introduction
Web studies rely on browser automation frameworks to ac-
crue data over thousands of sites. The goal of such stud-
ies is to provide a view on what regular visitors would en-
counter on the web. This relies on an (often unstated) as-
sumption that the data as collected is representative of what
a regular, human-controlled browser would encounter. Previ-
ous works [5, 14, 39,41] have shown that this is not always
the case: websites have been found to omit content (adver-
tisements, video, JavaScript execution, login forms, etc.) or
require completion of a CAPTCHA for automated clients.
While detectability of automated browsers has been discussed
online in various blogs [3, 64, 79] and discussion forums, to
the best of our knowledge, so far, no in-depth academic study
on the reliability of measurement frameworks built upon such
components has been performed.
In this work, we study the OpenWPM framework [27] for
measuring web privacy. To date, OpenWPM has been used
in at least 76 studies, 60 of which resulted in peer-reviewed
publications. As such, its fidelity, that is, the extent to which
the web OpenWPM encounters is the web as seen by other
web clients, is essential. This point is not lost on its users:
several studies explicitly remark bot detection as a possi-
ble threat for the validity of their measurements [12,46,73].
Recent studies investigated proliferation of generic bot de-
tection [39,40] and found over 10% of websites employing
such techniques. However, it is not clear how these findings
translate to OpenWPM. On the one hand, OpenWPM uses a
normal browser for collecting data, making it harder to dis-
tinguish from other visitors. On the other, OpenWPM targets
security and privacy research, an area where malicious actors
are to be expected [22,71]. It is not clear on how many sites
OpenWPM could be detected, nor is it clear what effect such
detection may have.
In this paper, we address this point. That is, we investigate
to what extent OpenWPM provides a reliable record of how
websites behave towards any web client. First and foremost,
this necessitates understanding how OpenWPM can be distin-
guished from other clients. Building on that knowledge, we
provide a first estimate of how many websites are able to dis-
tinguish OpenWPM from human visitors, finding that there
are even several websites that can distinguish OpenWPM
from other web bots. This allows these sites to use cloaking:
responding differently to different clients. Secondly, we inves-
tigate whether a website can actively attack OpenWPM’s data
collecting functionality. We find several new ways in which
a malicious website can attack OpenWPM’s data recording.
While both cloaking and data recording attacks are possible,
1
arXiv:2205.08890v1 [cs.CR] 18 May 2022
Figure 1: Components of the OpenWPM framework
the question remains whether OpenWPM-detecting sites em-
ploy such tactics. Recent work by Cassel et al. [14] finds
that Selenium-based bots receive far less third-party traffic,
which indicates cloaking happens in practice. We develop a
stealth extension for OpenWPM to prevent cloaking and test
its performance on sites where OpenWPM-detectors were
found.
Contributions. Our main contributions are:
(Sec. 4)
We provide the first analysis of OpenWPM’s de-
tectability based on both conventional fingerprinting [39]
and template attacks [63] techniques. We find previously
not reported, identifiable properties for every mode of
running OpenWPM (headless, Xvfb, etc.), even allowing
to distinguish between these modes.
(Sec. 5)
We look for bot detectors in the Tranco Top
100K sites that probe these properties via both static and
dynamic analysis. We find a drastic increase of Selenium-
based bot detection. In addition, we find detectors in the
wild specifically targeting OpenWPM clients.
(Sec. 6)
We explore how sites can attack OpenWPM’s
data collection. We find various attack vectors target-
ing OpenWPM’s most commonly used instruments and
implement proof-of-concept attacks for these.
(Sec. 7)
We harden OpenWPM against poisoning at-
tacks and detection. Our hardening hides all identifiable
properties when run in native mode and addresses the
identified attacks against OpenWPM’s instrumentation.
We evaluate its performance against vanilla OpenWPM.
The number of cookies received is severely impacted.
Conversely, ads/tracker traffic is hardly impacted.
2 Background
OpenWPM.
OpenWPM is a valuable tool for web re-
searchers, as it offers increased stability, fidelity and easy
access to measurement functionality on top of a browser au-
tomation framework (Selenium + WebDriver). The frame-
work can be run under either Ubuntu or macOS. It consists of
four parts (cf. Figure 1): a web client, automation components,
instrumentation for measurements, and a framework. As a
web client, OpenWPM uses an unbranded Firefox browser.
In contrast to a regular Firefox browser, this allows running
unsigned browser extensions. The various measurement in-
struments are implemented as browser extensions. They fa-
cilitate recording various website aspects, such as JavaScript
calls or HTTP traffic. The last part is the framework, which
acts as the conductor. Its purpose is to control browsers and
data collection. It also adds much-needed functionality, such
as monitoring for browser crashes and liveliness, restoring
after failures, loading input data, etc.
Use of OpenWPM in previous studies.
To understand
how OpenWPM is being used, we review the different stud-
ies performed to date with OpenWPM. In December 2021,
76 works, of which 57 peer-reviewed, were listed
1
as using
OpenWPM. We further add two recent studies that had not yet
been listed. For each study, we check the following: what is
measured, whether subpages are visited, whether interaction
is used, and what run mode is used. Table 1 summarises our
findings. Appendix D breaks down our findings for each study
individually.
The measures category tallies how many studies used
OpenWPM’s various measurement instruments: HTTP traffic,
cookies, and JavaScript. Each of these measures may be im-
pacted individually due to bot detection. Interestingly, while
most studies use OpenWPM to record HTTP traffic, a few
(e.g. [18,25,48, 68]) have used it as automation instead as a
measurement tool. These are tallied under ‘other’ in Table 1.
The other categories pertain to aspects that may impact de-
tectability. In each case, it is currently not known whether
these play a role in bot detection. With respect to the interac-
tion category, we note that no study mentioned implementing
interaction mechanisms. Therefore, we assume all studies
used OpenWPM’s default interaction functionality.
With respect to the run mode category, note that not all
studies provide information about this. Nevertheless, the used
run mode may impact detectability (e.g. [35]) and thus should
be considered. We therefore consider all currently supported
modes:
a. Unspecified: study does not specify mode,
b. regular: study uses a full Firefox browsers,
c. headless: study uses Firefox without a GUI,
d.
Xvfb: as regular, with visual output redirected to a buffer,
e.
Docker: study runs OpenWPM within a Docker
container,
f.
Virtualisation: study uses virtual machines, possibly in
cloud infrastructure.
Lastly, we track whether the studies considered bot detec-
tion at all and, if so, whether they used OpenWPM’s built-in
anti-detection features. Aside from studies investigating bot
1
https://webtap.princeton.edu/software/
2
detection directly, only very few consider fingerprinting [73]
or cloaking [12,46] as a potential risk for valid results.
Table 1: Measurement characteristics in 60 peer-reviewed
studies that are built upon OpenWPM
Category Studies
Measures
– HTTP traffic 49
– cookies 30
– JavaScript 17
– other 5
Run mode
– unspecified 49
– virtualisation 14
– headless 5
– regular mode 3
– Docker 2
– Xvfb 2
Category Studies
Interaction
– no interaction 48
– clicking 9
– scrolling 7
– typing 4
Subpages
– not visited 45
– visited 15
Bot detection
– ignored 46
– discussed 14
uses mitigation 7
3 Related Work
Determining the fingerprint surface of web bots.
Browser fingerprinting [24] has been studied extensively in
the context of user tracking, as recently summarised in [44].
The idea of using fingerprinting to identify certain client com-
ponents (such as automation frameworks) has gained more
attention recently. Vastel [79] and Shekyan [64] conducted
manual investigations of headless browsers to pin down iden-
tifiable properties these frameworks. Jonker et al. [39] auto-
mated the search for identifiable properties by using a browser
fingerprinting library. They compared properties of regular
browsers against properties of bots that belong to the same
engine class. In contrast, Schwarz et al. [63] applied a new
form of fingerprinting (JavaScript template attacks) to per-
form client-side vulnerability scanning. For a template cre-
ation, they traverse object hierarchy and store characteristics
of each object. Later on, templates can be compared to deter-
mine the differenbce. Finally, Vastel et al. [80] inspected bot
detectors in the wild to collect known identifiable properties.
They used these to systematically test the responses by bot
detectors to their changes.
Our work comprises the both automated approaches, by
Jonker et al. and Schwarz et al., to explore the fingerprint
surface of OpenWPM. We apply these systematically to the
various run modes of OpenWPM clients, uncovering distin-
guishers for each mode.
Measuring bot detection in the wild.
Two studies exist
that carried out a large-scale investigation of the existence of
unknown fingerprint-based bot detectors. Jonker et al. [39]
scanned 1M websites gathering statically included scripts
and analysing these using static code analysis. Shortly after,
Jueckstock and Kapravelos [40] presented a similar experi-
ment using dynamic script collection and dynamic analysis.
Their presented tool relies on a modified V8 engine to instru-
ment browser functions.
Reliability of scraping results.
Recently, multiple studies
have been conducted that explore differences between various
automated clients, and also between automated clients and
human-driven clients in website responses. Ahmad et al. [5]
investigate response differences between three classes of bots
(HTTP engine tools, headless browsers, and automated native
browsers). They found that while HTTP engine tools miss
many important resources, they more often pass bot detection
than the other two classes. Jueckstock et al. [41] studied dif-
ferences between headless Chrome and regular Chrome. For
regular Chrome, they used a puppeteer-plugin which hides dis-
tinguishable properties in Chrome to focus on bot detection.
Their results reinforce previous recommendations [27, 79]
to not use headless browsers. Zeber et al. [82] contrast data
from human users with OpenWPM clients. In their study,
OpenWPM clients encountered three times more tracking
domains and had more interaction with third-party domains
than human-controlled browsers. Cassel et al. [14] investigate
the reliability of emulated browsers. To avoid bot detection,
they created their own tooling to remotely control a browser.
Interestingly, their observations show the opposite of Zeber et
al.s findings. They observed 84% less third party traffic for a
Selenium-driven vs. a non-Selenium-driven Firefox browser.
This contradiction shows that there is yet no consistent picture
for the influence of bot detection on measurements. Further
investigation to resolve this conundrum is needed. Any such
investigation necessitates tooling that can evade bot detection.
We aim to develop such a tool for OpenWPM.
4 Fingerprint surface of OpenWPM
We begin by addressing the research question how can Open-
WPM be distinguished from human-controlled web clients?
In general, a web site operator looking to identify OpenWPM
clients can either probe for identifiable properties (i.e., fin-
gerprinting), or attempt to recognise OpenWPM’s interaction.
The latter is due to Selenium, whose interaction was studied
in detail by Goßen et al. [34]. Those results fully carry over to
OpenWPM. This leaves uncertainty about how OpenWPM’s
fingerprint distinguishes it from other clients and other bots.
In line with previous works, we call that part of a browser
fingerprint that distinguish a certain type of client from other
types the fingerprint surface [72]. Determining the fingerprint
surface of an OpenWPM client requires a way to find its prop-
erties that deviate from properties and values in other clients.
Jonker et al. [39] showed that it suffices to consider differences
3
Table 2: Summary of deviating properties of each OpenWPM
setup contrasted with OpenWPM’s Firefox version
macOS Ubuntu Docker
RM HM RM HM Xvfb RM
navigator.webdriver is true X X X X X X
screen dimension prop. X X X X X X
screen position prop. X X X X X X
font enumeration X
timezone is 0 X
navigator.languages prop. 43 43
deviating WebGL prop. 2037 2061 18 27
With instrumentation:
- through tampering +253 +253 +252 +252 +252 +252
- added custom functions +1 +1 +1 +1 +1 +1
RM: Regular mode; HM: Headless mode; Xvfb: X virtual frame buffer
mode.
within the client’s ‘browser family’, that is, fingerprint differ-
ences with those clients who use the same rendering engine
and JavaScript engine. By comparing the results for multiple
clients of the same browser family, differences unique to each
client are brought to light. In previous works, two approaches
for browser fingerprinting were used: probing a specific list
of properties [39], or using an automated approach for DOM
traversion [63]. While there is overlap between the results of
these methods, neither offers a complete superset of the other.
We combine the results of both approaches to determine the
fingerprint surface.
4.1 RQ1: How recognisable is OpenWPM?
We determine OpenWPM’s fingerprint surface by compar-
ing its client to a standalone version of the same Firefox
browser. Any differences must originate in the hosting envi-
ronment, the framework itself, the base implementation, the
added automation, or measurement components. To account
for possible effects of the various run modes of OpenWPM
on the fingerprint surface, we determine variations for each
setup on Ubuntu and macOS. Table 2 summarises identify-
ing properties found in the current version for each mode. In
addition to ways to recognise OpenWPM’s instrumentation,
we also identify ways to recognise Selenium and WebDriver,
display-less scraping (headless or Xvfb mode), and use from
within a virtual environment. Thus, every mode of running
OpenWPM is identifiable as a web bot.
Recognising automation components.
We found three
identification measures that may be applied against all modes
of OpenWPM. First, the
navigator.webdriver
property in-
dicates a browser controlled via the WebDriver interface.
2
Second, screen properties use standard values and cannot be
changed from OpenWPM (see Table 3). For example, the
2
https://www.w3.org/TR/webdriver2/#example-1
browser window position for macOS is 4px from the top and
23px from the left. On macOS, all browser instances will use
the same absolute coordinates, while on Ubuntu, each window
is shifted by the same offset, when using regular mode.
Table 3: Screen properties for various configurations
OS Mode Resolution Window X Y Offset (x, y)
macOS Regular 2560 x 1440 1366 x 683 23 4 0, 0
Headless 1366 x 768 1366 x 683 4 4 0, 0
Ubuntu Regular 2560 x 1440 1366 x 683 80 35 8, 8
Headless 1366 x 768 1366 x 683 0 0 0, 0
Xvfb 1366 x 768 1366 x 683 0 0 0, 0
Docker 2560 x 1440 1366 x 683 0 0 0, 0
Identification of missing displays.
Suppressing output to
display (by using Xvfb, headless, or Docker) adds a signifi-
cant number of differences. In headless mode, the lack of a
WebGL implementation leads to thousands of missing prop-
erties. We also observe that this mode adds 43 new properties
to the
navigator.language
object. Xvfb mode uses a reg-
ular Firefox browser, which contains WebGL functionality.
Nevertheless, Xvfb mode causes 5 changed and 13 missing
properties. Interestingly, both headless and Xvfb mode allow
the detection of missing user elements by accessing the prop-
erty screen.availTop. This describes the first y-coordinate that
does not belong to the user interface
3
. In display-less modes,
this is always zero, while regular browsers have larger values.
Traces of virtualisation.
Using OpenWPM’s docker con-
tainer causes the WebGL vendor property to contain the term
VMware, Inc.
(cf. Table 4) – clear evidence for the use of
virtualisation. In addition, the Docker environment reduces
the number of available JavaScript fonts to one (Bitstream
Vera Sans Mono), nor does it provide information about the
time zone.
Table 4: Selected deviations in display-less modes on Ubuntu
Mode WebGL vendors avail{Top|Left}
RM AMD AMD TAHITI 27, 72
HM Null 0, 0
Xvfb Mesa/X.org llvmpipe (LLVM 12.0.0,. . . ) 0, 0
Docker VMware, Inc. llvmpipe (LLVM 10.0.0,. . . ) 27, 72
Detecting instrumentation.
We checked if using any of
OpenWPM’s various instruments has any effect on its fin-
gerprint surface. The only differences occur when using the
JavaScript instrument. First, this instrument overwrites cer-
tain of the browser’s standard JavaScript objects, which can
be detected by using the
toString
function of a function
3
https://developer.mozilla.org/en-US/docs/Web/API/Scree
n/availTop
4
Figure 2: Properties in a (A) original object or (B) by the
instrumentation polluted object.
or object (see Listing 1). Another identifying aspect of this
instrument is the presence of a function in the window ob-
ject (
window.getInstrumentJS
), which is not present in any
common desktop browser (Firefox, Safari, Chrome, Edge,
Opera). Third, OpenWPM’s wrapper functions can be found
in stack traces. For that, a script need to provoke an error
in any overwritten function and catch the stack trace to suc-
cessfully identify a modification by OpenWPM. Lastly, the
instrument ‘pollutes’ prototypes along the prototype chain of
an object. Instrumenting is done by changing the prototype of
an object, as well as all its ancestor prototypes. However, the
properties of later ancestor prototypes are all added to the first
ancestor prototype (cf., Fig. 2). This distinguishes a visitor
with instrumentation from one without.
Evaluation.
We validate whether the identified fingerprint
surface works in practice to identify OpenWPM. For that,
we implemented a OpenWPM detector, that uses four tests
to identify OpenWPM amongst web clients: (1) test for the
presence of a DOM property, (2) test for a missing DOM
property, (3) test if a native function was overwritten, (4)
compare a DOM property with an expected value.
We tested the detector by setting up four machines, 2 Mac-
intoshes and 2 PCs with Ubuntu. On each machine, we used
OpenWPM and common browsers (Chrome, Safari, Opera
and Firefox). We tested each distinguishing property from
Table 2. Our detector site was able to correctly identify Open-
window.canvas.getContext.toString();
// output of .toString when not instrumented
"function getContext() {
[native code]
}"
// output of .toString when instrumented
"function () {
const callContext = \
getOriginatingScriptContext(!!logSettings.logCallStack);
logCall(objectName + "." + \
methodName, arguments, callContext, logSettings);
return func.apply(this, arguments);
}"
Listing 1: Detectability of OpenWPM’s JavaScript
instrumentation
WPM every single time. Almost all properties uniquely iden-
tify OpenWPM, except for a few WebGL- and screen-related
properties. For a few WebGL properties (roughly 200 of
4K), we found that these also occur on some non-OpenWPM
clients. Ignoring all such properties still leaves a large number
of identifying properties.
4.2 RQ2: How stable is the fingerprint sur-
face?
We explored how stable our determined fingerprint surface
is, as new Firefox and OpenWPM versions may appear fre-
quently. To that end, we repeated our experiments for older
version of OpenWPM (0.11.0 and 0.10.0). In general, we
found that the fingerprint surfaces largely overlap. For exam-
ple, on MacOS, the number of WebGL deviations in headless
mode increases to 2037 in OpenWPM 0.17.0, from Open-
WPM 0.11.0’s 2022. In the oldest OpenWPM version (0.10.0),
we find that the JavaScript instrument adds two properties
instead of one to the window object (
jsInstruments
and
instrumentFingerprintingApis
). In addition, we also in-
vestigated whether using an unbranded browser (as Open-
WPM does) impacts OpenWPM’s fingerprint. We found no
differences between branded and unbranded Firefox versions.
Using outdated browsers, however, does impact the finger-
print. For example, Google’s reCAPTCHA service assigns a
higher risk to older browser variants [65]. In the past, Open-
WPM’s integrated Firefox version has been behind the official
release of Firefox several times (cf. Table 11 in Appendix B).
We found that OpenWPM used an outdated Firefox browser
71% of the last 20 months. In short, this distinction vector
should be expected when using OpenWPM.
5 Incidence of OpenWPM detection
To assess the extent of OpenWPM detection in the wild, we
conduct a large-scale measurement for client-side bot detec-
tion. In detail, we focus on scripts with capabilities to detect
OpenWPM, i.e. scripts with routines to access properties
unique for Selenium-based bots and/or OpenWPM. We find
both general Selenium detectors and OpenWPM-specific de-
tectors.
5.1 Data acquisition and classification
Methodology.
Previous automated approaches [39,40] to
identify bot detectors have either relied on static or dynamic
analysis. The idea behind static analysis is to identify code
patterns in source code that link to known bot detectors or
that use specific bot-related properties. A limitation is that
scripts may create code dynamically, which will be missed
5
out by static analysis. Moreover, minification and obfusca-
tion further increase the false negative rate of static analysis.
The alternative approach, dynamic analysis, is to monitor
JavaScript calls that identify a script as bot detector based
on access to bot-related properties. Dynamic analysis does
cover dynamically-generated scripts. Moreover, it does not
monitor the code itself, but only executed calls. An upside
of this is that neither minification nor obfuscation affects the
analysis. On the other hand, code that happens not to be exe-
cuted during the run, is not analysed. Both static and dynamic
analysis have been able to identify some bot detectors in the
wild. It is not clear whether and to what extent the results of
the methods differ in practice for finding web bot detectors.
We combine both methods to increase coverage.
Setup.
In order to assess the extent of client-side bot detec-
tion, we scan the top 100K websites of the Tranco list [45]
4
.
We set up an instance of OpenWPM running Firefox in regu-
lar mode. During a site visit, our OpenWPM client stores a
copy of any transmitted JavaScript file and records JavaScript
calls. We add an initial waiting time of 45 seconds after a
completed page load to give websites enough time to perform
JavaScript operations. In addition, we instruct our client mea-
sures the presence of bot detection on subpages by opening a
maximum of three URLs extracted from a site’s landing page.
For selecting subpages, we consider only URLs linking to
the same domain. Within this and the following sections, we
apply scheme
eTLD+1
to identify a domain. To account for
websites that use same origin requests to redirecting clients
to foreign domains, our client checks if a foreign domain was
entered after following all redirects.
Scripts should be classified as bot detectors if they access
the fingerprint surface of OpenWPM. However, certain
scripts may access these attributes for other purposes, such as
checking supported WebGL functionality. To reduce such
false positives, we only classify a script as bot-detecting
when it accesses properties pertaining to browser automation
or are unique to OpenWPM (cf. Sec. 4.1). This leaves only
the following:
navigator.webdriver
, which is specific to
WebDriver-controlled bots; and the new identifying properties
introduced by OpenWPM’s JavaScript instrumentation:
getInstrumentJS
,
instrumentFingerprintingApis
,
and
jsInstruments
. Table 5 shows the results of the data
collection and classification.
Limitations.
Inherent in the above approach are several as-
sumptions that can impact the results. First, our approach
relies on the fingerprint surface we established. Detectors
based on other methods (e.g., mouse tracking [16]) will be
missed. Second, we do not account for cross-site tracking. A
third-party tracker could classify our client as a bot on one
4
https://tranco-list.eu/list/WV79
site and would need only to re-identify the client on another
site, e.g., using IP filtering or regular browser fingerprinting.
This amounts to a form of website cloaking – serving different
content to specific clients. To what extent third party tracking
in general employs cloaking is a different study and left to
future work. Both these limitations may cause underestima-
tion of the number of detectors (false negatives). As such,
our approach approximates a lower bound on the number of
detectors in the wild.
Preprocessing for static analysis.
Within the static analy-
sis, we pre-process scripts to undo straightforward obfusca-
tion. We derive the respective encoding, transform hex literals
to ASCII characters, and remove code comments. We apply
our static analysis to scripts that we collected during our scan
of the Tranco Top 100K, which resulted in 1,535,306 unique
scripts. To identify Selenium-detector scripts, we then use
patterns to look for access to
navigator.webdriver
(more
details can be found in Appendix C).
Using honey properties to catch iterators.
For the dy-
namic analysis, every recorded access to the fingerprint sur-
face identifies a script with the potential to detect OpenWPM
as a bot. This will also be triggered by scripts that iterate
over all properties, e.g., for regular browser fingerprinting
(re-identification). Determining the purpose of such iteration
requires per-script manual inspection and goes beyond dy-
namic analysis.
To determine whether property iteration takes place, we
extend our client’s navigator and window object with ‘honey’
properties. These honey properties are added on the fly and
use random strings as name. Hence, only a script using prop-
erty iteration would access all honey properties. We assign
scripts that use property iteration into three categories, based
on access to the
navigator.webdriver
property: definitely
detecting bots, and inconclusive. Iterator scripts are classified
as inconclusive if they do not access
navigator.webdriver
,
as all accesses to the fingerprint surface could be due to prop-
erty iteration. Scripts that iterate the navigator object will
naturally access the
webdriver
property. To check whether
this access is only by iteration or intentional, we distinguish
between scripts that trigger our static analysis and those that
do not. Only scripts that do not surface in the static analysis
are classified as inconclusive.
5.2 RQ3: How often is OpenWPM detected?
Our results show that, when checking both front- and sub-
pages, at least 16.7% of websites in the Tranco Top 100K
execute scripts that accessed properties specific to Selenium
and, thereby, OpenWPM. Moreover, we also find scripts ac-
cessing OpenWPM-specific properties.
6
Table 5: Number of websites with Selenium detectors
# sites static dynamic union
identified 32,694 19,139 38,264
without false positives / ‘inconclusive’ 15,838 16,762 18,714
Table 6: Number of sites ordered by script domains accessing
OpenWPM-specific properties
cz gs google.com ad1t
total 331 14 9 2
jsInstruments 331 5 2 2
instrumentFingerprintingApis 0 6 4 0
getInstrumentJS 0 3 3 0
cz: cheqzone.com, gs: googlesyndication.com, ad1t: adzouk1tag.com
OpenWPM-specific properties are accessed in the wild.
Most scripts we found recognise OpenWPM by targeting Se-
lenium. A small number of detectors, also include specific rou-
tines to detect OpenWPM itself. Overall, 356 sites executed
scripts that accessed OpenWPM-specific properties. These
scripts were all included via third-party domains, belonging
to four distinct providers. Table 6 summarises these detectors
and their detection method. Detectors on cheqzone.com were
found by both static and dynamic analysis; detectors on the
other three domains used some form of minification, obfus-
cation, and/or dynamic loading, and were only found by dy-
namic analysis. We investigated the four hosting domains by
consulting
whois
records, EasyList,
5
and the WhoTracksMe
database [17]. All domains are related to the advertising indus-
try. The domain
cheqzone.com
belongs to CHEQ, a company
fighting ad fraud. The scripts hosted by Google domains are
included through Google’s reCAPTCHA service. While we
could not clarify the origin of
adzouk1tag.com
, we found
this domain listed in the EasyList for ad domains.
14% of sites have bot detection on the front page.
Fig-
ure 4 depicts the distribution for detectors active on the
front page of websites for static and dynamic analysis. Dy-
namic analysis without considering property iteration iden-
tifies 12,208 sites with detectors on the front page. Static
analysis measures the number of sites where bot detection
could be triggered (11,897), including those where detection
is present but not (yet) executed, e.g., where detection is only
triggered after hovering over certain elements. While both
static and dynamic analysis identify a similar number of de-
tectors for each bucket, they do not fully overlap. Combining
both provides a slight increase in the presence of detectors
(1.7K sites).
5
https://easylist.to/easylist/easylist.txt
Figure 3: Number of sites with bot detectors on front- and
subpages (depicted per 1K sites)
Figure 4: Detectors found on front pages
Deep scanning increases the rate of detection by 5 per
cent points.
As discussed in Section 2, 15% of studies con-
ducted with OpenWPM (also) investigated subpages. This
raises the question whether such studies are more often sub-
ject to bot detection, that is: does bot detection occur more
frequently on subpages? Figure 3 depicts the occurrence of
bot detectors on front pages and subpages. In general, studies
examining subpages are at greater risk to be detected: the
number of sites with active detectors increases for by at least
37%. Hence, the average detection rate within the Top 100K
sites will increase. That is: the study will be exposed to more
detectors. Combining the results of both measurements, we
see an increase of 5 per cent points (from 14% to 19%).
5.3 RQ4: By whom is OpenWPM detected?
To explore this question, we separated detectors into first and
third parties. We find that the majority of sites includes detec-
tors from third-party domains. We count how often scripts on
these third-party domains are included on scanned sites, tal-
lying each third-party domain once per including site. Some
sites include more than one detector, hence the total num-
ber of inclusions exceeds the number of sites with detectors.
Overall, we count 3,867 first-party detector scripts and 21,325
third-party detector scripts.
7
Figure 5: Common categories of sites including detectors
First and third-party bot detection are used differently
among the industry.
We further explore what sites include
detectors, as this may provide a better view on what bot de-
tection is used for. For that, we collect categories for the
identified 16K websites with detectors based on Symantec’s
site review service (
https://sitereview.norton.com/
).
Sites may be assigned multiple categories; for such sites, we
tally each listed category. Figure 5 depicts the 16 most often
tallied categories for both first-party detectors (4,198 times)
and third-party detectors (16,323 times). We find that news
sites are responsible for 18.4% of all third-party inclusions,
followed by Technology (9%) and Business (7%). Interest-
ingly, the ranks for Shopping (16.4%) and News (5%) switch
for first-party detector inclusions. Moreover, sites in the cate-
gories Finance (8% vs 3%) and Travel (7% vs 2%) make up
for a larger portion in the set of first-party inclusions than for
third parties.
We believe that these uneven distribution of inclusions is
explainable. While every site owner will want to protect their
site from nefarious bots (and thus reason to include first-party
detection), advertising has become a popular business model
for websites. For such sites, third parties have a vested interest
in detecting bots: to detect ad fraud. Thus, on such sites, one
would expect more third-party bot detection.
Third-party bot detection typically serves the advertise-
ment industry.
Following up on the previous point, we in-
vestigated the origins of third party detectors. Table 7 breaks
down the most common included domains. The top 10 do-
mains account for two third of inclusions. The site Who-
Tracks.me [17] categorises trackers according to purpose.
Using this, we find that the bot-detecting scripts on the most
commonly included domains can serve a variety of purposes.
For example,
yandex.ru
offers scripts used for advertising,
content delivery network, site analytics, social media, and oth-
ers. Other uses include web analytics (crazyegg.com), CDN
(jsdelivr.net) and live chat (intercomcdn.com). However, bot
detection is most commonly deployed by advertisers (e.g.,
domains 2,3,4,7,9, and 10 in Table 7).
Table 7: Domains hosting 3
rd
-party detector scripts
hosting domain # inclusions (1/site) %
all 21,325 100%
1 yandex.ru 3,848 18.04%
2 adsafeprotected.com 2,309 10.83%
3 moatads.com 2,165 10.15%
4 webgains.io 2,091 9.81%
5 crazyegg.com 1,552 7.28%
6 intercomcdn.com 1,061 4.98%
7 teads.tv 854 4.00%
8 jsdelivr.net 423 1.98%
9 mxcdn.net 416 1.95%
10 mgid.com 402 1.89%
11+ remaining 704 domains 6,204 29.1%
The vast majority of first-party detectors are embedded
third parties.
To determine the origins of first-party bot
detection scripts, we look for similarities between their inclu-
sions of detectors. To do so, we hash the scripts and check
for structural similarities in script URLs (for more details see
Appendix A). We found various similarities amongst unre-
lated sites. Scripts originating from Akamai occur the most
frequent (1,004 sites). Second is Incapsula (998 sites), third is
an unknown bot detector (659 sites), and fourth is Cloudflare
(486 sites). Together, these top three originators account for
3,147 out of 3,867 sites (88%) where we found first-party
detectors. In contrast to the purpose of third-party detectors,
first-party detectors are not supplied by advertisement compa-
nies. Moreover, Akamai, Incapsula and Cloudflare all offer
commercial bot detection services. With that in mind, one
should expect sites with first-party detectors to likely tailor
their responses for detected bots (e.g., throttling, blocking,
withholding resources, and serving CAPTCHAs).
6 Attacking JavaScript recording
We investigate whether a malicious website or third party
could corrupt OpenWPM’s data collection process. In partic-
ular, we consider an attacker that can deliver arbitrary content
(HTML, cookies, JavaScript), but cannot break the browser’s
security model. To do so, our focus resides on attacks against
the integrity or completeness of measurements. More specif-
ically, we aim to attack the resilience of OpenWPM’s most
commonly used instruments: HTTP traffic, cookie record-
ing, and JavaScript call recording. Both HTTP and cookie
instruments are simple wrappers around browser functionality.
Breaking them thus requires breaking the browser, which is
outside the attacker model. The JavaScript instrument, on the
other hand, needs to supply all its monitoring functionality
itself. It is therefore clearly in scope of our attacker model.
Since the instruments focus on data recording, we investi-
gate attacks on data recording. More specifically, we consider:
8
1. whether data recording may be prevented;
2.
whether fake data can be injected into the data recorder;
3.
whether already recorded data can be deleted or altered;
4. finally, whether the data recording is complete.
Instruments in OpenWPM are implemented as a browser
extension. Extensions are isolated to protect higher privilege
APIs from access by untrusted code. Website scripts thus
cannot directly interact with extensions. However, both ex-
tensions and website scripts can read and change the DOM,
opening the door for injection attacks against extensions that
read the DOM. We conducted source code analysis for each
instrument under investigation to identify vulnerabilities to
such attacks. Below we discuss the found vulnerabilities.
6.1 RQ5: How to prevent data recording?
We found a vulnerability that, when successfully exploited,
allows a website to break OpenWPM’s data recording hooks.
The vulnerability can be leveraged to turn off recording of
JavaScript calls in the JavaScript instrument.
More specifically: the JavaScript instrument overwrites
several API functions by hooking into the DOM’s event dis-
patcher (to record access to them). The event dispatcher then
sends messages to be recorded back to the JavaScript instru-
ment’s back end. To prevent an attacker from silently undoing
these hooks, OpenWPM also hooks into (and thus: records ac-
cess to) setters and getters to these API functions themselves.
However, the event dispatcher itself is not protected. We thus
can alter the event dispatcher to inject our own messages and
manipulate messages sent to OpenWPM (cf., Listing. 2). To
carry out this attack, the attacker overrides the event dispatcher
to block all messages (all events from instrumented objects).
This would already block OpenWPM recording, by breaking
any JavaScript API calls. However, this also would break a
website’s own JavaScript. To block only OpenWPM mes-
sages, the block needs to be tailored. Conveniently, tags mes-
sages with an ID to identify any monitored objects. Though
this ID is randomly generated, it can easily be determined:
simply trigger an API call to a monitored object, acquire the
random ID from the observed message, and update the event
dispatcher to only block messages containing this ID.
6.2 RQ6: Can fake data be injected?
The previous attack, altering the event dispatcher, not only
allows an attacker to block data recording, it also allows an
attacker to learn the ID OpenWPM uses to record data. This
is sufficient to inject almost arbitrary messages to be recorded.
The attacker simply creates a custom event following the for-
mat used by OpenWPM’s JavaScript extension and includes
OpenWPM’s assigned event ID. This enables an attacker to
//Step I: Retrieve OpenWPM's random ID
function grabID() { return new Promise((resolve, reject) => {
let id;
document.dispatchEvent = function (event) {
id = event.type; document.dispatchEvent = dispatch_fn;
if (id !== undefined) { resolve(id);
} else { reject(new Error(msg));}
}
// Perform an action to grab the ID
navigator.userAgent;});}
// Step II: Overwrite event dispatcher to block events
async function attackExtension() {
let id = await grabID();
document.dispatchEvent = (event) => {
if (event.type != id) { dispatch_fn(event); // Dispatch event
} else {console.log("Event swallowed: " + event);}}}
Listing 2: Turn off the script recorder
define most of the content of the resulting entry in Open-
WPM’s recording, such as the executing script URL or which
function was called. Crucially, though, the website that orig-
inated the call is set outside of the browser by OpenWPM.
The data sent by the event dispatcher is properly sanitized by
the back-end, which prevents spoofing this. We can thus only
inject fake data for the currently visited website. Note that a
third party included on the site can also execute this attack.
6.3 RQ7: Can records be deleted or altered?
Whereas the previous attacks exploited a vulnerability in
the DOM-parsing front-end of the respective instruments,
deleting already recorded data requires manipulating an in-
strument’s back-end, for OpenWPM: SQLite. Attacking a
database back-end requires an SQL injection vulnerability.
As already mentioned, OpenWPM’s data recording back-end
properly sanitizes its inputs. This means that there is no pos-
sibility for an SQL injection via JavaScript recording. There-
fore, we conclude that it is not feasible to delete or alter al-
ready recorded data from OpenWPM’s SQLite database.
6.4 RQ8: Is data recording complete?
We investigated whether data recording is complete. We found
two different attacks against completeness: existence of unob-
served channels, and silent delivery of JavaScript code.
Existence of unobserved channels:
During our evalua-
tion, we found a way to bypass OpenWPM’s recording of
JavaScript function calls. This attack again exploits Open-
WPM’s hooks to record function calls. In particular, the hooks
must be attached to every object that is to be observed. For ev-
ery new window or iframe, this must be done afresh. However,
there is a long-standing bug in Chrome and Firefox (cf., [67]),
where both browsers under some circumstances fail to inject
scripts into iframes. We tested if OpenWPM’s implementation
is affect by this and we found that this is indeed the case.
9
Our evaluation of this attack involves two different ways to
access an iframe’s DOM
6
to create/execute iframes and their
code: static vs. dynamic creation and immediate vs. delayed
execution. Of these, immediate code execution (at creation
time) is required to successfully exploit this bug. None of the
other parameters we tested influenced the result. Listing 3
shows a proof-of-concept of this type of attack.
Silent delivery of JavaScript code:
Note that the aforemen-
tioned attacks based on JavaScript would appear in Open-
WPM recordings, if the HTTP instrument is used. Namely,
that instrument collects response bodies. That is, unless this
instrument’s recording can also be bypassed. We indeed
managed to extend our previous attacks to be silently trans-
ferred to OpenWPM. For that, we looked at the two options
that OpenWPM offers two to collect response bodies. Open-
WPM either stores all response bodies (full coverage), or
it can be set to store JavaScript files only. The latter op-
tion significantly reduces stored content. For this mode, we
found that an attacker can silently deliver JavaScript code
by sending it as text and processing it client-side, e.g., by
including a line like
<link src="server.com/payload"
content-type="text/plain">
in the HTML source. To
successfully bypass OpenWPM’s traffic recording of JS files,
three aspects must be accounted for:
i.
The content-type attribute must be set to something other
than text/javascript;
ii. The src attribute must not contain a “.js” extension;
iii.
the delivered file is not automatically executed; this must
be handled by a different client-side script (e.g., using
eval()).
// Operation will not appear in the recordings.
setTimeout(() => {
let element = document.querySelector("#unobserved");
let iframe = document.createElement('iframe');
// HTML code for instantiating an iFrame
iframe.src = "unobserved-iframe.html";
element.appendChild(iframe);
iframe.contentWindow.navigator.userAgent;
}, 500);
Listing 3: Example of an unobserved channel
7 Improving OpenWPM’s reliability
This section focuses on OpenWPM’s reliability as an instru-
ment measuring the web as encountered by regular visitors.
We explore how and to what extent reliability can be improved.
To do so, we design an approach to hardening OpenWPM’s
instrumentation and to hiding its distinctive fingerprint (from
here on referred to as WPM
hide
). Our proof-of-concept suc-
cessfully hides the telltale signs of OpenWPM from its finger-
print and makes OpenWPM robust in the face of the discussed
6
window.frames[0], and frame.contentWindow
attacks in a lab setting. To evaluate its effectiveness in an open
world setting, we run WPM
hide
against detectors in the wild
and contrast its measurements with those of a regular Open-
WPM client.
7.1 RQ9: How to hide the fingerprint surface?
OpenWPM’s characteristic fingerprint varies with the vari-
ous modes of running OpenWPM. For example, in headless
Firefox mode, the fingerprint surface is difficult to hide due
to headless mode’s lack of functionality when compared to
regular browsers. Hence, we focus on run modes where Open-
WPM runs the browsers natively (Regular Mode). For such
modes, we achieve stealth by overriding properties without
leaving traces. These techniques can also be applied in other
run modes (e.g., virtualisation).
The identifying properties for Regular Mode (see Ta-
ble 2) relate to the
webdriver
property, window position,
and dimension. Of OpenWPM’s various instruments, only
the JavaScript instrument causes further identifiable proper-
ties. Hiding these properties can be achieved by a customized
browser, or by including additional code inside a page’s scope.
Implementing the former requires significant work, but it can
hide the fingerprint near-perfectly. The latter approach is far
simpler to implement but risks leaving residual traces. For
our proof-of-concept, we choose the second option, as it can
be seamlessly integrated within the current OpenWPM frame-
work without significant effort.
Our proof-of-concept must address two aspects: hiding
the automation components and preventing detection of in-
strumentation. To prevent detection of instrumentation, four
issues need fixing (Sec. 4.1): (1) calling the
toString
op-
eration of overwritten functions must return the regular out-
put string for browser functions; (2) no additional property
may appear in the DOM; (3) stack traces must not show any
signs of the instrumentation; (4) prototype pollution must be
avoided. Lastly, hiding instrumentation requires hiding their
detectable aspects, similar to how
toString
must appear un-
changed.
Preserve toString output.
For the first issue, we found
that CanvasBlocker
7
addresses this well. Its implementation
successfully fools all our fingerprinting tests (Sec. 4). Can-
vasBlocker creates a getter function with an identical signa-
ture to the function that must be overwritten and attaches
it to the DOM based on a specific Firefox feature called
exportFunction
. The newly exported function is then used
to redefine the getter of a object’s prototype for a specific
property. As a result, the overwritten function returns the na-
tive code string like a default browser property (cf., Listing 1).
7
https://github.com/kkapsner/CanvasBlocker
10
Normally, accessing the getter of an object’s prototype leads
to an error. If this getter is replaced with a custom getter, that
error is never thrown. This makes tampering with properties
via an object’s prototype detectable [34]. Calling the original
getter from the customised getter results in the original error
being thrown, addressing this aspect of the fingerprint surface.
Preserve clean DOM.
The second issue arises during page
load, prior to the page’s JavaScript activation. The instru-
mentation injects its code as script from the content context
into the page context, overwrites the needed properties, and
removes its code from the page context again. However, in
practice, not all injected functions are deleted. We update
the instrument to overwrite all functionality directly from the
content context, thus keeping the page context clean.
Faking stack traces.
The third issue requires the stack
trace to show no signs of instrumented functions. A web
page can only access stack traces if errors occur. Normally,
if an error occurs, the stack trace would show that the called
function is called from inside the instrumentation. We address
this by catching each error and throwing a new error with
properly adjusted values for file name, column, message, and
line number.
Avoid prototype pollution.
The last issue relates to the pol-
lution of an object’s prototype, as OpenWPM’s instrument
modifies only the first prototype in the prototype chain. We
address this by overwriting properties per prototype. Unfortu-
nately, this approach has its own limitation, as it is not possible
to determine the caller of a function, when a prototype has
multiple children. Especially for prototypes located higher up
the chain, the number of children increases; raising the poten-
tial to capture unwanted API calls on other children objects.
To test our implementation, we instrumented the same API
calls as used by OpenWPM. Luckily, most of our these APIs
are provided by prototypes close to the bottom, which allows
us to cover a wide set of OpenWPM’s instrumented APIs.
Preventing detection of automation components.
The
automation components are detectable by window size, win-
dow position and the webdriver attribute. For the latter, our
hidden version must set the
navigator.webdriver
property
to false like a regular Firefox browser. Since Firefox ver-
sion 88, this flag is not user-settable.
8
We override the getter
function of the
navigator.webdriver
property in the same
fashion as described in the previous section. To change Open-
WPM default window settings, we introduce a settings file that
makes the window size and position settable in OpenWPM.
8
https://bugzilla.mozilla.org/show_bug.cgi?id=1632821
7.2 RQ10: How to mitigate recording attacks?
Securing messaging from page context to background
context (see Sec. 6.1, 6.2).
A key benefit from migrating
to Firefox’s
exportFunction
, as described in the previous
section, is the ability to export higher privileged browser func-
tions into the page. Hence, we can port functionality to the
page context that is otherwise only available for content or
background scripts of a browser extension. We use this to
secure our instrumented functions, as we now can use the
browser.runtime
API to pass messages from the page to
the background context. It is crucial that such functionality
is exported to a private scope of an overwritten function to
prevent access by other scripts in the page context. This pre-
vents the ‘turn recording off’ and ‘inject fake data’ attacks, as
an attacker cannot manipulate message transmissions to the
background script.
Improving coverage of the JavaScript instrument (see
Sec. 6.4).
To address the tested variants of incomplete
recordings, we use CanvasBlocker’s frame protection. The
basic idea is to intercept APIs used by page scripts to mod-
ify the DOM or create a new, non-instrumented copy of the
DOM. This ensures that each modification or newly con-
structed DOM contains the instrumentation. Our implemen-
tation covers five cases: window constructors, DOM modi-
fication API, window mutations, and DOM creation via the
document.write API, and finally the window.open API.
Filtering of the HTTP file recorder (see Sec. 6.4).
To the
best of our knowledge, there is no known way to distinguish
JavaScript code from text that is robust against a dedicated
obfuscator. Therefore, an active adversary should be assumed
to be capable of hiding JavaScript in a way that would ac-
cidentally be filtered out. Since this issue only arises in the
presence of active adversaries, we recommend in such a case
not to use any filtering.
7.3 Evaluation of PoC implementation
We developed a proof-of-concept implementation to hide
the tell-tale signs of automation and to mitigate the found
attacks. We evaluate the impact of our proof-of-concept im-
plementation (from here on, WPM
hide
) on web measurements
when encountering bot detection in the wild. To that end, we
contrast its results with vanilla OpenWPM (from here on:
WPM) in HTTP traffic, cookies, JavaScript execution, and
delivered JavaScript files. We test on all sites with bot detec-
tors (as found by dynamic analysis) from the Tranco Top 5K
(see Sec. 5). This list contains 1,417 sites with either first-
party or third-party detectors. On these sites, we run WPM
and WPM
hide
in parallel (OpenWPM v.0.18.0, regular mode,
11
HTTP, JavaScript and cookie instrument activated) and con-
figure each browser to idle 60 seconds on a page after loading
completed. We take steps to mitigate noise in measurements.
In particular, we avoid cross-client interferences by separating
both crawlers via two individual machines and IP addresses.
Each IP address belongs to a residential network and comes
from the same municipal and internet provider, which avoids
differences caused to cloud-based IP blocking [37] and geo-
location. Secondly, we re-synchronise the machines every 100
visit. This ensures that sites are loaded roughly simultaneously
on both machines (max. offset is below four minutes).
Sites that detect OpenWPM serve less media resources.
In our experiment, we found that WPM
hide
encounters 3.45%
more HTTP requests. As our data set is not normally dis-
tributed, we tested for significance using Wilcoxon signed-
rank test with a confidence interval of 95%. For that, we
divided the traffic into first and third-party requests and find
significant differences for HTTP requests to both first- and
third-parties (
p
-value
<
0.0001). In more detail, we found for
WPM, 175 sites (12%) lead to more first-party and 472 sites
(33%) to more third-party requests. For WPM
hide
, we count
400 sites (28%) with more first-party and 654 sites (46%) with
more third-party requests. This indicates a stronger variability
in third-party traffic, leaning towards less detectable clients.
Table 8 shows requests for each machine per requested re-
source type.
9
The table shows that WPM
hide
receives roughly
double the number of audio and video files (type media).
Moreover, requested images (image and imageset) is in-
creased by
3%, and executable code (script) by
4%. More-
over, WPM incurs three times the number of CSP violations –
though this may also be due to embedding more JavaScript in
the page context. Finally, the difference in websocket requests
is due to a single outlier. Thus, we do not expect websocket
requests to change significantly between WPM and WPM
hide
.
Equivalent amount of ads/trackers traffic.
To assess the
amount of trackers and advertisers in traffic, we use the same
approach as previous works [5, 14,41]: use the EasyList and
EasyPrivacy blocklists
10
to identify trackers. Our results show
that WPM and WPM
hide
encounter a near equal rate of adver-
tisers and trackers. For WPM, ads and trackers account for
14.3% and 11.6% of total traffic. For WPM
hide
, this is 14.2%
and 11.5%, respectively – almost equivalent.
Large differences in served cookies.
For cookies, we con-
trasted the number of cookies between both variants. We
found that these differ significantly for both first parties and
third parties (
p
-value
<
0.0001). Specifically, 305 sites serve
9
https://developer.mozilla.org/en-US/docs/Mozilla/Add- o
ns/WebExtensions/API/webRequest/ResourceType
10
https://easylist.to/
Table 8: Comparison of HTTP request resource types
Resource type WPM WPM
hide
Diff.
csp_report 884 298 -66.29%
websocket 467 242 -48.18%
media 378 552 +46.03%
beacon 3,804 4,453 +17.06%
imageset 4,888 5,432 +11.13%
xmlhttprequest 46,199 49,398 +6.92%
script 73,527 76,430 +3.95%
object 53 55 +3.77%
other 92 95 +3.26%
main_frame 3,883 3,757 -3.24%
image 101,256 103,801 +2.51%
sub_frame 11,119 10,885 -2.10%
stylesheet 9,663 9,840 +1.83%
font 9,557 9,704 +1.54%
Total 265,770 274,942 +3.45%
WPM
hide
more first-party cookies, while only 146 sites serve
WPM more first-party cookies. Interestingly, the opposite is
true for third-parties. Here we find 824 sites whose third par-
ties offer WPM more cookies than WPM
hide
; the other way
around happens for the third parties of 227 sites. In total, the
number of cookies is 55,853 (WPM) vs. 46,736 (WPM
hide
).
Using WPM
hide
thus leads to a decrease of 16.32% of cookies.
We also looked at cookies as possible means to track users.
To determine whether a cookie can be used for web tracking,
we use the approach of Englehardt et al. [28], as refined by
Chen et al. [15]. According to this method, a cookie may be
used for tracking when: (1) it cannot be a session cookie, (2)
the length of the cookie is 8 or more characters (excluding
surrounding quotes), (3) the cookie is always set, and (4) the
values differ significantly based on the Ratcliff-Obershelp
algorithm [10]. While 5,307 cookies satisfy these criteria for
WPM, only 2,282 cookies for WPM
hide
match; a decrease of
57%.
8 Conclusions
Reliability of automated measurements on trial.
Our
work demonstrates that OpenWPM is susceptible to attacks
threatening its reliability. In particular: virtualisation makes
scaling web studies easy, but turned out to undermine Open-
WPM’s reliability as a measurement tool. It is an open ques-
tion whether other automation / measurement frameworks
suffer similarly from virtualisation.
Bot detection on the rise.
In comparison with previous
studies, we see the number of sites looking for the
webdriver
property has significantly increased in the span of less than
one year (Tbl. 9). This rapid change clearly suggests that
web sites are swiftly transitioning to responding differently to
automated clients than to regular clients. Web studies should
12
Table 9: Studies measuring
webdriver
property access on
front pages
when analysis corpus # sites %
[40] 2019–10 dynamic Alexa 50K 2,756 5.51%
This paper 2020–07 combined Tranco 100K 13,989 13.99%
static 11,957 11,96%
dynamic 12,194 12.19%
therefore no longer ignore the potential impact of bot detec-
tion on their study.
Towards robust instrumentation.
Our findings highlight
the difficulties of deploying instruments via the page context.
To improve robustness, we advocate moving the instruments
outside of page scope. To achieve this, the debugger API
could be leveraged. However, OpenWPM uses Selenium v3,
which does not support this (planned for Selenium v4). Alter-
natively, instrumentation could be integrated in the browser’s
source code. This would give great flexibility in hiding dis-
tinctive aspects of the browser fingerprint. This would also
incur significant additional maintenance overhead slowing
adoption of new browser versions. However, OpenWPM’s
rate of adoption is already slow the tradeoff may thus be
worth it.
Advice for conducting a web measurement study.
While
the evaluation of our proof-of-concept is limited in scope,
we still find significant differences in a variety of attributes.
While studies that focus on the amount of traffic seem to be
(for now) in the clear, studies that focus on audio/video files or
web tracking via cookies must take bot detection into account
(Table 8). Similarly, studies that automatically crawl beyond
the front page will encounter more bot detectors (Table 1).
Ethics.
Our work aims to make OpenWPM a more reliable
measurement framework. We responsible disclosed our find-
ings and shared fixes of the identified issues. This helps make
OpenWPM less detectable, and therefore its results more reli-
able. Of course, a less detectable scraper may itself be abused.
For attacking specific sites, our improvements do not greatly
impact the attack surface: a less detectable OpenWPM is a fine
tool for studying thousands of sites, but not for a targeted at-
tack on a specific site. For attacks that span thousands of sites
(e.g., clickfarming), our improvements do not help: disguising
as a regular browser is insufficient to overcome contemporary
defenses. For that, site-specific fingerprints are needed [72].
Thus, existing re-identification-based countermeasures (e.g.,
rate limiting) are not impacted.
Availability & responsible disclosure.
Our stealth exten-
sion is available via GitHub.
11
We disclosed our findings
(both attacks and identifiable properties) to the OpenWPM de-
velopers. We are working towards having our fixes integrated
into the framework.
References
[1]
Gunes Acar, Steven Englehardt, and Arvind Narayanan.
No boundaries: data exfiltration by third parties embed-
ded on web pages. Proc. Priv. Enhancing Technol.,
2020(4):220–238, 2020.
[2]
Gunes Acar, Christian Eubank, Steven Englehardt, Marc
Juárez, Arvind Narayanan, and Claudia Díaz. The web
never forgets: Persistent tracking mechanisms in the
wild. In Proc. 21st ACM SIGSAC Conference on Com-
puter and Communications Security (CCS’14), pages
674–689. ACM, 2014.
[3]
adtechmadness. Bot detection 101 #2 entering
browser fingerprinting. "
https://adtechmadnes
s.wordpress.com/2019/03/05/bot-detection
-101-2-entering-browser-fingerprinting/
",
2019. last access: May 19, 2022.
[4]
Pushkal Agarwal, Sagar Joglekar, Panagiotis Papadopou-
los, Nishanth Sastry, and Nicolas Kourtellis. Stop track-
ing me bro! differential tracking of user demographics
on hyper-partisan websites. In Proc. The Web Confer-
ence 2020, pages 1479–1490. ACM / IW3C2, 2020.
[5]
Syed Suleman Ahmad, Muhammad Daniyal Dar,
Muhammad Fareed Zaffar, Narseo Vallina-Rodriguez,
and Rishab Nithyanand. Apophanies or epiphanies?
How crawlers impact our understanding of the web.
In Proc. The Web Conference 2020, WWW ’20, page
271–280. ACM, 2020.
[6]
Suzan Ali, Tousif Osman, Mohammad Mannan, and
Amr M. Youssef. On privacy risks of public wifi cap-
tive portals. In DPM/CBT@ESORICS, volume 11737
of Lecture Notes in Computer Science, pages 80–98.
Springer, 2019.
[7]
Ibrahim Altaweel, Nathan Good, and Chris Jay Hoof-
nagle. Web privacy census. Technology Science,
2015121502, 2015.
[8]
Amelia Andersdotter and Anders Jensen-Urstad. Eval-
uating websites and their adherence to data protection
principles: Tools and experiences - contributions to IFIP
summer school proceedings. In Privacy and Identity
11
https://github.com/bkrumnow/OpenWPM/tree/stealth_exten
sion
13
Management, volume 498 of IFIP Advances in Infor-
mation and Communication Technology, pages 39–51,
2016.
[9]
Reuben Binns, Jun Zhao, Max Van Kleek, and Nigel
Shadbolt. Measuring third-party tracker power across
web and mobile. ACM Trans. Internet Techn.,
18(4):52:1–52:22, 2018.
[10]
Paul E. Black. Ratcliff/obershelp pattern recogni-
tion.
https://www.nist.gov/dads/HTML/ratclif
fObershelp.html, 2021. last access: May 19, 2022.
[11]
Justin Brookman, Phoebe Rouge, Aaron Alva, and
Christina Yeung. Cross-device tracking: Measure-
ment and disclosures. Proc. Priv. Enhancing Technol.,
2017(2):133–148, 2017.
[12]
Stefano Calzavara, Tobias Urban, Dennis Tatang, Marius
Steffens, and Ben Stock. Reining in the web’s incon-
sistencies with site policy. In Proc. 28th Network and
Distributed Systems Security Symposium (NDSS’21),
NDSS 2021, pages 1–16. The Internet Society, February
2021.
[13]
Stefano Calzavara, Tobias Urban, Dennis Tatang, Marius
Steffens, and Ben Stock. Reining in the web’s incon-
sistencies with site policy. In Proc. 28th Network and
Distributed System Security Symposium (NDSS’21). The
Internet Society, 2021.
[14]
Darion Cassel, Su-Chin Lin, Alessio Buraggina, William
Wang, Andrew Zhang, Lujo Bauer, Hsu-Chun Hsiao,
Limin Jia, and Timothy Libert. Omnicrawl: Compre-
hensive measurement of web tracking with real desktop
and mobile browsers. Proc. 22nd Privacy Enhancing
Technologies Symposium (PETS’22), 2022(1):227–252,
2022.
[15]
Quan Chen, Panagiotis Ilia, Michalis Polychronakis, and
Alexandros Kapravelos. Cookie swap party: Abusing
first-party cookies for web tracking. In Proc. The Web
Conference 2022, pages 2117–2129. ACM / IW3C2,
2021.
[16]
Zi Chu, Steven Gianvecchio, Aaron Koehl, Haining
Wang, and Sushil Jajodia. Blog or block: Detecting
blog bots through behavioral biometrics. Computer
Networks, 57(3):634–646, 2013.
[17]
Cliqz GmbH. Whotracks.me - learn about tracking
technologies, market structure and data-sharing on the
web.
https://whotracks.me/
, 2021. last access: May
19, 2022.
[18]
John Cook, Rishab Nithyanand, and Zubair Shafiq. In-
ferring tracker-advertiser relationships in the online ad-
vertising ecosystem using header bidding. Proc. Priv.
Enhancing Technol., 2020(1):65–82, 2020.
[19]
Vittoria Cozza, Van Tien Hoang, Marinella Petrocchi,
and Rocco De Nicola. Transparency in keyword faceted
search: An investigation on google shopping. In IR-
CDL, volume 988 of Communications in Computer and
Information Science, pages 29–43. Springer, 2019.
[20]
Ha Dao and Kensuke Fukuda. Characterizing CNAME
cloaking-based tracking on the web. In TMA. IFIP,
2020.
[21]
Ha Dao and Kensuke Fukuda. A machine learning
approach for detecting CNAME cloaking-based tracking
on the web. In GLOBECOM, pages 1–6. IEEE, 2020.
[22]
Ha Dao, Johan Mazel, and Kensuke Fukuda. Under-
standing abusive web resources: characteristics and
counter-measures of malicious web resources and cryp-
tocurrency mining. In AINTEC, pages 54–61. ACM,
2018.
[23]
Anupam Das, Gunes Acar, Nikita Borisov, and Amogh
Pradeep. The web’s sixth sense: A study of scripts
accessing smartphone sensors. In ACM Conference on
Computer and Communications Security, pages 1515–
1532. ACM, 2018.
[24]
Peter Eckersley. How unique is your web browser? In
Proc. 10th Privacy Enhancing Technologies Symposium
(PETS’10), volume 6205 of LNCS, pages 1–18. Springer,
2010.
[25]
Rob van Eijk, Hadi Asghari, Philipp Winter, and Arvind
Narayanan. The impact of user location on cookie no-
tices (inside and outside of the european union). In Work-
shop on Technology and Consumer Protection (Con-
Pro’19), 2019.
[26]
Steven Englehardt, Jeffrey Han, and Arvind Narayanan.
I never signed up for this! privacy implications of email
tracking. PoPETs, 2018(1):109–126, 2018.
[27]
Steven Englehardt and Arvind Narayanan. Online track-
ing: A 1-million-site measurement and analysis. In
Proc. 23rd ACM SIGSAC Conference on Computer and
Communications Security (CCS’16), pages 1388–1401.
ACM, 2016.
[28]
Steven Englehardt, Dillon Reisman, Christian Eu-
bank, Peter Zimmerman, Jonathan R. Mayer, Arvind
Narayanan, and Edward W. Felten. Cookies that give
you away: The surveillance implications of web track-
ing. In Proc. 24th International Conference on World
Wide Web (WWW’15), pages 289–299. ACM, 2015.
[29]
Imane Fouad, Nataliia Bielova, Arnaud Legout, and
Natasa Sarafijanovic-Djukic. Missed by filter lists: De-
tecting unknown third-party trackers with invisible pix-
els. Proc. Priv. Enhancing Technol., 2020(2):499–518,
2020.
14
[30]
Imane Fouad, Cristiana Santos, Feras Al Kassar, Na-
taliia Bielova, and Stefano Calzavara. On compliance of
cookie purposes with the purpose specification principle.
In EuroS&P Workshops, pages 326–333. IEEE, 2020.
[31]
Nathaniel Fruchter, Hsin Miao, Scott Stevenson, and
Rebecca Balebako. Variations in tracking in relation to
geographic location. Proc. of the 9th Workshop on Web
2.0 Security and Privacy (W2SP) 2015, 2015.
[32]
Steven Goldfeder, Harry A. Kalodner, Dillon Reis-
man, and Arvind Narayanan. When the cookie meets
the blockchain: Privacy risks of web payments via
cryptocurrencies. Proc. Priv. Enhancing Technol.,
2018(4):179–199, 2018.
[33]
Hélder Gomes, André Zúquete, Gonçalo Paiva Dias, and
Fábio Marques. Usage of HTTPS by municipal websites
in portugal. In WorldCIST (2), volume 931 of Advances
in Intelligent Systems and Computing, pages 155–164.
Springer, 2019.
[34]
Daniel Goßen, Hugo Jonker, Stefan Karsch, Benjamin
Krumnow, and David Roefs. HLISA: towards a more
reliable measurement tool. In Proc. 21st ACM Inter-
net Measurement Conference (IMC’21), pages 380–389.
ACM, 2021.
[35]
Grant Ho, Dan Boneh, Lucas Ballard, and Niels Provos.
Tick tock: Building browser red pills from timing side
channels. In WOOT, pages 1–11. USENIX Association,
2014.
[36]
Xuehui Hu, Guillermo Suarez de Tangil, and Nishanth
Sastry. Multi-country study of third party trackers from
real browser histories. In EuroS&P, pages 70–86. IEEE,
2020.
[37]
Luca Invernizzi, Kurt Thomas, Alexandros Kaprave-
los, Oxana Comanescu, Jean Michel Picod, and Elie
Bursztein. Cloak of visibility: Detecting when machines
browse a different web. In Proc. 37th IEEE Symposium
on Security and Privacy, SP 2016, San Jose, CA, USA,
May 22-26, 2016, pages 743–758, 2016.
[38]
Umar Iqbal, Steven Englehardt, and Zubair Shafiq.
Fingerprinting the fingerprinters: Learning to detect
browser fingerprinting behaviors. In 2021 IEEE Sympo-
sium on Security and Privacy (SP), pages 1143–1161,
2021.
[39]
Hugo Jonker, Benjamin Krumnow, and Gabry Vlot. Fin-
gerprint surface-based detection of web bot detectors. In
Proceedings of 24th European Symposium on Research
in Computer Security (ESORICS’19), LNCS, pages 586–
605. Springer, 2019.
[40]
Jordan Jueckstock and Alexandros Kapravelos. Visi-
blev8: In-browser monitoring of javascript in the wild.
In Proc. 19th ACM Internet Measurement Conference,
pages 393–405. ACM, 2019.
[41]
Jordan Jueckstock, Shaown Sarker, Peter Snyder, Aidan
Beggs, Panagiotis Papadopoulos, Matteo Varvello, Ben
Livshits, and Alexandros Kapravelos. Towards realistic
and reproducible web crawl measurements. In Proc. The
Web Conference 2021 (WWW’21). ACM, 2021.
[42]
Martin Koop, Erik Tews, and Stefan Katzenbeisser. In-
depth evaluation of redirect tracking and link usage.
Proc. Priv. Enhancing Technol., 2020(4):394–413, 2020.
[43]
Michael Kranch and Joseph Bonneau. HTTPS in mid-
air: An empirical study of strict transport security and
key pinning. In Proc. 22nd Network and Distributed
System Security Symposium (NDSS’15). The Internet
Society, 2015.
[44]
Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and
Gildas Avoine. Browser fingerprinting: A survey. ACM
Transactions on the Web (TWEB), 14(2):1–33, 2020.
[45]
Victor Le Pochat, Tom Van Goethem, Samaneh Tajal-
izadehkhoob, Maciej Korczy
´
nski, and Wouter Joosen.
Tranco: A research-oriented top sites ranking hardened
against manipulation. In Proc. 26th Network and Dis-
tributed System Security Symposium (NDSS’19), NDSS
2019, pages 1–15. The Internet Society, 2019.
[46]
Baojun Liu, Zhou Li, Peiyuan Zong, Chaoyi Lu, Hai-
Xin Duan, Ying Liu, Sumayah A. Alrwais, XiaoFeng
Wang, Shuang Hao, Yaoqi Jia, Yiming Zhang, Kai Chen,
and Zaifeng Zhang. Traffickstop: Detecting and measur-
ing illicit traffic monetization through large-scale DNS
analysis. In EuroS&P, pages 560–575. IEEE, 2019.
[47]
Baojun Liu, Zhou Li, Peiyuan Zong, Chaoyi Lu, Hai-
Xin Duan, Ying Liu, Sumayah A. Alrwais, XiaoFeng
Wang, Shuang Hao, Yaoqi Jia, Yiming Zhang, Kai Chen,
and Zaifeng Zhang. Traffickstop: Detecting and measur-
ing illicit traffic monetization through large-scale DNS
analysis. In IEEE European Symposium on Security
and Privacy, EuroS&P 2019, Stockholm, Sweden, June
17-19, 2019, pages 560–575. IEEE, 2019.
[48]
Fang Liu, Chun Wang, Andres Pico, Danfeng Yao, and
Gang Wang. Measuring the insecurity of mobile deep
links of android. In USENIX Security Symposium, pages
953–969. USENIX Association, 2017.
[49]
Max Maass, Stephan Schwär, and Matthias Hollick. To-
wards transparency in email tracking. In APF, volume
11498 of Lecture Notes in Computer Science, pages 18–
27. Springer, 2019.
15
[50]
Max Maaß, Pascal Wichmann, Henning Pridöhl, and
Dominik Herrmann. Privacyscore: Improving privacy
and security via crowd-sourced benchmarks of websites.
In APF, volume 10518 of Lecture Notes in Computer
Science, pages 178–191. Springer, 2017.
[51]
Arunesh Mathur, Gunes Acar, Michael Friedman, Elena
Lucherini, Jonathan R. Mayer, Marshini Chetty, and
Arvind Narayanan. Dark patterns at scale: Findings
from a crawl of 11k shopping websites. Proc. ACM
Hum. Comput. Interact., 3(CSCW):81:1–81:32, 2019.
[52]
Johan Mazel, Richard Garnier, and Kensuke Fukuda. A
comparison of web privacy protection techniques. Com-
put. Commun., 144:162–174, 2019.
[53]
Najmeh Miramirkhani, Oleksii Starov, and Nick Niki-
forakis. Dial one for scam: A large-scale analysis of
technical support scams. In Proc. 24th Network and Dis-
tributed System Security Symposium (NDSS’17). The
Internet Society, 2017.
[54]
Lukasz Olejnik, Steven Englehardt, and Arvind
Narayanan. Battery status not included: Assessing
privacy in web standards. In IWPE@SP, volume
1873 of CEUR Workshop Proceedings, pages 17–24.
CEUR-WS.org, 2017.
[55]
Shahrooz Pouryousef, Muhammad Daniyal Dar, Sule-
man Ahmad, Phillipa Gill, and Rishab Nithyanand. Ex-
tortion or expansion? an investigation into the costs and
consequences of ICANN’s gTLD experiments. In PAM,
volume 12048 of Lecture Notes in Computer Science,
pages 141–157. Springer, 2020.
[56]
Tarun Ramadorai, Antoine Uettwiller, and Ansgar
Walther. The market for data privacy. Technical re-
port, 2019.
[57]
Andrew Reed and Michael J. Kranch. Identifying https-
protected netflix videos in real-time. In CODASPY,
pages 361–368. ACM, 2017.
[58]
Valentino Rizzo, Stefano Traverso, and Marco Mellia.
Unveiling web fingerprinting in the wild via code mining
and machine learning. Proc. Priv. Enhancing Technol.,
2021(1):43–63, 2021.
[59]
Nicky Robinson and Joseph Bonneau. Cognitive discon-
nect: understanding facebook connect login permissions.
In COSN, pages 247–258. ACM, 2014.
[60]
Takahito Sakamoto and Masahiro Matsunaga. After
gdpr, still tracking or not? understanding opt-out states
for online behavioral advertising. In IEEE Symposium
on Security and Privacy Workshops, pages 92–99. IEEE,
2019.
[61]
Nayanamana Samarasinghe and Mohammad Mannan.
Towards a global perspective on web tracking. Comput.
Secur., 87, 2019.
[62]
Steven Schmeiser. Online advertising networks and
consumer perceptions of privacy. Applied Economics
Letters, 25(11):776–780, 2017.
[63]
Michael Schwarz, Florian Lackner, and Daniel Gruss.
Javascript template attacks: Automatically inferring host
information for targeted exploits. In Proc. 26th Annual
Network and Distributed System Security Symposium
(NDSS’19). The Internet Society, 2019.
[64]
Sergey Shekyan. Detecting PhantomJS based visitors.
https://blog.shapesecurity.com/2015/01/22/
detecting-phantomjs-based-visitors/
, 2015.
last access: May 19, 2022.
[65]
Suphannee Sivakorn, Jason Polakis, and Angelos D
Keromytis. I’m not a human: Breaking the google re-
captcha. Black Hat, pages 1–12, 2016.
[66]
Ido Sivan-Sevilla, Wenyi Chu, Xiaoyu Liang, and Helen
Nissenbaum. Unaccounted privacy violation: A compar-
ative analysis of persistent identification of users across
social contexts. 2021.
[67]
Peter Snyder, Cynthia Bagier Taylor, and Chris Kanich.
Most websites don’t need to vibrate: A cost-benefit ap-
proach to improving browser security. In Proc. 24th
ACM SIGSAC Conference on Computer and Communi-
cations Security (CCS’17), pages 179–194. ACM, 2017.
[68]
Konstantinos Solomos, Panagiotis Ilia, Sotiris Ioannidis,
and Nicolas Kourtellis. TALON: an automated frame-
work for cross-device tracking detection. In RAID, pages
227–241. USENIX Association, 2019.
[69]
Konstantinos Solomos, Panagiotis Ilia, and Nicolas
Kourtellis. Clash of the trackers: Measuring the evolu-
tion of the online tracking ecosystem. In TMA. IFIP,
2020.
[70]
Jannick Kirk Sørensen and Sokol Kosta. Before and
after GDPR: the changes in third party presence at pub-
lic and private european websites. In Proc. The Web
Conference 2019, pages 1590–1600. ACM, 2019.
[71]
Oleksii Starov, Johannes Dahse, Syed Sharique Ahmad,
Thorsten Holz, and Nick Nikiforakis. No honor among
thieves: A large-scale analysis of malicious web shells.
In Proc. 25th International Conference on World Wide
Web (WWW’16), pages 1021–1032. ACM, 2016.
[72]
Christof Ferreira Torres, Hugo L. Jonker, and Sjouke
Mauw. Fp-block: Usable web privacy by controlling
16
browser fingerprinting. In Proc. 20th European Sympo-
sium on Research in Computer Security (ESORICS’15),
Proceedings, Part II, volume 9327 of LNCS, pages 3–19.
Springer, 2015.
[73]
Tobias Urban, Martin Degeling, Thorsten Holz, and Nor-
bert Pohlmann. Beyond the front page:measuring third
party dynamics in the field. In Proc. The Web Confer-
ence 2020, WWW ’20, page 1275–1286. Association
for Computing Machinery, 2020.
[74]
Tobias Urban, Dennis Tatang, Martin Degeling,
Thorsten Holz, and Norbert Pohlmann. A study on
subject data access in online advertising after the GDPR.
In DPM/CBT@ESORICS, volume 11737 of Lecture
Notes in Computer Science, pages 61–79. Springer,
2019.
[75]
Tobias Urban, Dennis Tatang, Martin Degeling,
Thorsten Holz, and Norbert Pohlmann. Measuring the
impact of the GDPR on data sharing in ad networks. In
AsiaCCS, pages 222–235. ACM, 2020.
[76]
Pelayo Vallina, Álvaro Feal, Julien Gamba, Narseo
Vallina-Rodriguez, and Antonio Fernández Anta. Tales
from the porn: A comprehensive privacy analysis of
the web porn ecosystem. In Proc. 19th ACM Internet
Measurement Conference, pages 245–258. ACM, 2019.
[77]
Steven Van Acker, Daniel Hausknecht, and Andrei
Sabelfeld. Raising the bar: Evaluating origin-wide secu-
rity manifests. In ACSAC, pages 342–354. ACM, 2018.
[78]
Rob Van Eijk, Hadi Asghari, Philipp Winter, and Arvind
Narayanan. The impact of user location on cookie no-
tices (inside and outside of the european union). In Work-
shop on Technology and Consumer Protection (Con-
Pro’19). IEEE. IEEE, 2019.
[79]
Antoine Vastel. Detecting Chrome headless, new
techniques. "
https://antoinevastel.com/bot%
20detection/2018/01/17/detect-chrome-headl
ess-v2.html", 2018. last access: May 19, 2022.
[80]
Antoine Vastel, Walter Rudametkin, Romain Rouvoy,
and Xavier Blanc. FP-Crawlers: Studying the Resilience
of Browser Fingerprinting to Block Crawlers. In Proc.
2nd NDSS Workshop on Measurements, Attacks, and
Defenses for the Web (MADWEB’20), pages 2–14, 2020.
[81]
Zhiju Yang and Chuan Yue. A comparative measure-
ment study of web tracking on mobile and desktop envi-
ronments. Proc. Priv. Enhancing Technol., 2020(2):24–
44, 2020.
[82]
David Zeber, Sarah Bird, Camila Oliveira, Walter
Rudametkin, Ilana Segall, Fredrik Wollsén, and Mar-
tin Lopatka. The representativeness of automated web
crawls as a surrogate for human browsing. In Proc.
The Web Conference 2020, WWW ’20, page 167–178.
Association for Computing Machinery, 2020.
A Patterns for common first-party detectors
Table 10, shows patterns we found in our first-party script anal-
ysis from Section 5.3. Scripts provided by Akamai, Incapsula,
Cloudflare, and PerimeterX follow the same script pattern,
these can be easily recognised. For the unknown script, we
found that the for common path patterns between larger clus-
ters of script hashes. A manual validation showed that scripts
found under the listed path are most similar.
Table 10: Similarities in first-party detectors
Origin URL path similarities # sites
Akamai domain/akam/11/. .. 1,004
Incapsula domain/_Incapsula_Resource?. . . 998
Unknown domain/asssets/{hash of 31-32 bits length} 659
domain/resources/{hash of 32-33 bits length}
domain/public/{hash of 32-33 bits length}
domain/static/{hash of 34 bits length}
Cloudflare domain/. . ./cdn-cgi/bm/cv/2172558837/api.js 486
PerimeterX domain/. . . /{8 character string}/init.js 134
B Adoption of new Firefox versions by Open-
WPM.
Releases of OpenWPM do not appear synchronously with
Firefox. As a result, certain time frames exist where the
OpenWPM client uses an older Firefox versions than reg-
ular users. Table 11 summarises migration of Firefox versions
in the OpenWPM Framework since version 0.10.0. Between
the release of Firefox 77 (March 2020) and the release of
OpenWPM v.0.18.0, amounts to 561 days. Within this period,
OpenWPM was shipped with an outdated version 398 days
(71%).
C Patterns used in static analysis
We iterate on the pattern design to reduce false positives. Our
very first run used patterns matching strings literally. However,
in the specific case of matching the term webdriver, we found
that this selects scripts that use this word in another context
than checking Selenium-driven Firefox browsers (cf. [39,
40] for conflicting bot detection properties with this term).
In the next iteration we used patterns that take the context
of the access to a property into account. For example, the
pattern
navigator\[["']webdriver["']\]
only matches
17
Table 11: Integration of Firefox releases into OpenWPM
Firefox release date OpenWPM integration date Outdated
95.0 12/07/21 0.18.0 12/16/21 69 days
94.0 11/02/21
93.0 10/05/21
92.0 09/07/21
91.0 08/10/21
90.0 07/13/21 0.17.0 07/24/21 11 days
89.0 06/01/21 0.16.0 06/10/21 9 days
88.0 04/19/21 0.15.0 05/10/21 48 days
87.0 03/23/21
86.0.1 03/11/21 0.14.0 03/12/21 87 days
85.0 01/26/21
84.0 12/15/20
83.0 11/18/20 0.13.0 11/19/20 58 days
82.0 10/20/20
81.0 09/22/20
80.0 08/25/20 0.12.0 08/26/20 29 days
79.0 07/28/20
78.0.1 07/01/20 0.11.0 07/09/20 8 days
77.0 06/03/20 0.10.0 06/23/20 20 days
if the
webdriver
property is checked via the navigator object.
Table 12 lists our explored patterns. Finally, we manually
checked a random subset to check pattern performance. Only
one pattern still introduced false positives; all its matches
were manually validated and false positives eliminated.
Table 12: Patterns evaluated in static analysis
Pattern false positives found
webdriver X
instrumentFingerprintingApis -
getInstrumentJS -
jsInstruments -
(?<!_|-)webdriver(?!_|-) X
navigator.webdriver -
navigator\[["']webdriver["']\] -
D Previous studies relying on OpenWPM
Table 13 provides a detailed view on our analysis of previous
peer-reviewed studies based on OpenWPM. Each category
that applies to a study is marked with a “
X
”. For those stud-
ies that measure certain aspects, but rely on out of bound
mechanisms (e.g., by deploying a proxy) and do not rely on
OpenWPM’s instrumentation are marked with a “
”. Running
modes are shortened in the table as follow: unspecified (u),
native (n), headless (h), xvfb (x), docker (d), virtual machine
(v). Papers that are not included in the seed list, but where
added by us, are highlighted with a “?”.
18
Table 13: Overview of previous studies using OpenWPM for web studies
deployed as measures uses visits uses mentions
Year Ref. 1
st
Author Mode VM Cookies HTTP JS Scrolling Clicking Typing Sub-pages Anti-BD BD
2014 [2] Acar u X X
[59] Robinson u X X
2015 [28] Englehardt u X X X
[43] Kranch u X
[7] Altaweel h X X X X
[31] Fruchter u X X X
2016 [8] Andersdotter u X X X
[27] Englehardt x X X X X X
[71] Starov u X X
2017 [53] Miramirkhani u X X X
[11] Brookman u X X X
[57] Reed u X
[54] Olejnik u X
[50] Maass u X X
[48] Liu h
[62] Schmeiser u X
2018 [32] Goldfeder u X
[26] Englehardt u X X X X X
[9] Binns h X X
[23] Das u X X X
[77] van Acker u X
[22] Dao u X
2019 [19] Cozza u X X X X
[33] Gomes u X
[78] van Eijk d
[70] Sørensen u X X X
[47] Liu u X X
[56] Mathur u X X X
[51] Ramadorai u X X
[52] Mazel u X
[6] Ali u X
[61] Samarasinghe u X X X
[49] Maass u X
[68] Solomos u X X
[76] Vallina u X X X
[39] Jonker h X X
[74] Urban u X X X
[60] Sakamoto u X
2020 [29] Fouad u X X X
[18] Cook u X X X
[81] Yang u X X X X
[1] Acar u X X X X X X
[42] Koop d X X X X X
[82] Zeber n/x X X X X X
[5] Ahmad u X X X
[4] Agarwal h X X X X
[73] Urban u X X X X X X X
[75] Urban u X X X X X X X
[55] Pouryousef u X
[30] Fouad u X X
[66] Sivan-Sevilla u X X X X X X
[36] Hu u X X
[20] Dao u X
[69] Solomos n X X
[21] Dao u X
2021 [13] Calzavara u X X X X
[58] Rizzo u X X X X
[38] Iqbal u X X
[34] Goßen
?
n X X X X X
2022 [14] Cassel
?
u X
19