Analysing and strengthening OpenWPM’s reliability
Benjamin Krumnow
TH Köln,
Open University Netherlands
Hugo Jonker
Open University Netherlands,
Radboud University
Stefan Karsch
TH Köln
Automated browsers are widely used to study the web at scale.
Their premise is that they measure what regular browsers
would encounter on the web. In practice, deviations due to
detection of automation have been found. To what extent
automated browsers can be improved to reduce such devia-
tions has so far not been investigated in detail. In this paper,
we investigate this for a specific web automation framework:
OpenWPM, a popular research framework specifically de-
signed to study web privacy. We analyse (1) detectability of
OpenWPM, (2) prevalence of OpenWPM detection, and (3)
integrity of OpenWPM’s data recording.
Our analysis reveals OpenWPM is easily detectable. We
measure to what extent fingerprint-based detection is already
leveraged against OpenWPM clients on 100,000 sites and
observe that it is commonly detected (
14% of front pages).
Moreover, we discover integrated routines in scripts to specif-
ically detect OpenWPM clients. Our investigation of Open-
WPM’s data recording integrity identifies novel evasion tech-
niques and previously unknown attacks against OpenWPM’s
instrumentation. We investigate and develop mitigations to
address the identified issues. In conclusion, we find that re-
liability of automation frameworks should not be taken for
granted. Identifiability of such frameworks should be studied,
and mitigations deployed, to improve reliability.
1 Introduction
Web studies rely on browser automation frameworks to ac-
crue data over thousands of sites. The goal of such stud-
ies is to provide a view on what regular visitors would en-
counter on the web. This relies on an (often unstated) as-
sumption that the data as collected is representative of what
a regular, human-controlled browser would encounter. Previ-
ous works [5, 14, 39,41] have shown that this is not always
the case: websites have been found to omit content (adver-
tisements, video, JavaScript execution, login forms, etc.) or
require completion of a CAPTCHA for automated clients.
While detectability of automated browsers has been discussed
online in various blogs [3, 64, 79] and discussion forums, to
the best of our knowledge, so far, no in-depth academic study
on the reliability of measurement frameworks built upon such
components has been performed.
In this work, we study the OpenWPM framework [27] for
measuring web privacy. To date, OpenWPM has been used
in at least 76 studies, 60 of which resulted in peer-reviewed
publications. As such, its fidelity, that is, the extent to which
the web OpenWPM encounters is the web as seen by other
web clients, is essential. This point is not lost on its users:
several studies explicitly remark bot detection as a possi-
ble threat for the validity of their measurements [12,46,73].
Recent studies investigated proliferation of generic bot de-
tection [39,40] and found over 10% of websites employing
such techniques. However, it is not clear how these findings
translate to OpenWPM. On the one hand, OpenWPM uses a
normal browser for collecting data, making it harder to dis-
tinguish from other visitors. On the other, OpenWPM targets
security and privacy research, an area where malicious actors
are to be expected [22,71]. It is not clear on how many sites
OpenWPM could be detected, nor is it clear what effect such
detection may have.
In this paper, we address this point. That is, we investigate
to what extent OpenWPM provides a reliable record of how
websites behave towards any web client. First and foremost,
this necessitates understanding how OpenWPM can be distin-
guished from other clients. Building on that knowledge, we
provide a first estimate of how many websites are able to dis-
tinguish OpenWPM from human visitors, finding that there
are even several websites that can distinguish OpenWPM
from other web bots. This allows these sites to use cloaking:
responding differently to different clients. Secondly, we inves-
tigate whether a website can actively attack OpenWPM’s data
collecting functionality. We find several new ways in which
a malicious website can attack OpenWPM’s data recording.
While both cloaking and data recording attacks are possible,
arXiv:2205.08890v1 [cs.CR] 18 May 2022
Figure 1: Components of the OpenWPM framework
the question remains whether OpenWPM-detecting sites em-
ploy such tactics. Recent work by Cassel et al. [14] finds
that Selenium-based bots receive far less third-party traffic,
which indicates cloaking happens in practice. We develop a
stealth extension for OpenWPM to prevent cloaking and test
its performance on sites where OpenWPM-detectors were
Contributions. Our main contributions are:
(Sec. 4)
We provide the first analysis of OpenWPM’s de-
tectability based on both conventional fingerprinting [39]
and template attacks [63] techniques. We find previously
not reported, identifiable properties for every mode of
running OpenWPM (headless, Xvfb, etc.), even allowing
to distinguish between these modes.
(Sec. 5)
We look for bot detectors in the Tranco Top
100K sites that probe these properties via both static and
dynamic analysis. We find a drastic increase of Selenium-
based bot detection. In addition, we find detectors in the
wild specifically targeting OpenWPM clients.
(Sec. 6)
We explore how sites can attack OpenWPM’s
data collection. We find various attack vectors target-
ing OpenWPM’s most commonly used instruments and
implement proof-of-concept attacks for these.
(Sec. 7)
We harden OpenWPM against poisoning at-
tacks and detection. Our hardening hides all identifiable
properties when run in native mode and addresses the
identified attacks against OpenWPM’s instrumentation.
We evaluate its performance against vanilla OpenWPM.
The number of cookies received is severely impacted.
Conversely, ads/tracker traffic is hardly impacted.
2 Background
OpenWPM is a valuable tool for web re-
searchers, as it offers increased stability, fidelity and easy
access to measurement functionality on top of a browser au-
tomation framework (Selenium + WebDriver). The frame-
work can be run under either Ubuntu or macOS. It consists of
four parts (cf. Figure 1): a web client, automation components,
instrumentation for measurements, and a framework. As a
web client, OpenWPM uses an unbranded Firefox browser.
In contrast to a regular Firefox browser, this allows running
unsigned browser extensions. The various measurement in-
struments are implemented as browser extensions. They fa-
cilitate recording various website aspects, such as JavaScript
calls or HTTP traffic. The last part is the framework, which
acts as the conductor. Its purpose is to control browsers and
data collection. It also adds much-needed functionality, such
as monitoring for browser crashes and liveliness, restoring
after failures, loading input data, etc.
Use of OpenWPM in previous studies.
To understand
how OpenWPM is being used, we review the different stud-
ies performed to date with OpenWPM. In December 2021,
76 works, of which 57 peer-reviewed, were listed
as using
OpenWPM. We further add two recent studies that had not yet
been listed. For each study, we check the following: what is
measured, whether subpages are visited, whether interaction
is used, and what run mode is used. Table 1 summarises our
findings. Appendix D breaks down our findings for each study
The measures category tallies how many studies used
OpenWPM’s various measurement instruments: HTTP traffic,
cookies, and JavaScript. Each of these measures may be im-
pacted individually due to bot detection. Interestingly, while
most studies use OpenWPM to record HTTP traffic, a few
(e.g. [18,25,48, 68]) have used it as automation instead as a
measurement tool. These are tallied under ‘other’ in Table 1.
The other categories pertain to aspects that may impact de-
tectability. In each case, it is currently not known whether
these play a role in bot detection. With respect to the interac-
tion category, we note that no study mentioned implementing
interaction mechanisms. Therefore, we assume all studies
used OpenWPM’s default interaction functionality.
With respect to the run mode category, note that not all
studies provide information about this. Nevertheless, the used
run mode may impact detectability (e.g. [35]) and thus should
be considered. We therefore consider all currently supported
a. Unspecified: study does not specify mode,
b. regular: study uses a full Firefox browsers,
c. headless: study uses Firefox without a GUI,
Xvfb: as regular, with visual output redirected to a buffer,
Docker: study runs OpenWPM within a Docker
Virtualisation: study uses virtual machines, possibly in
cloud infrastructure.
Lastly, we track whether the studies considered bot detec-
tion at all and, if so, whether they used OpenWPM’s built-in
anti-detection features. Aside from studies investigating bot
detection directly, only very few consider fingerprinting [73]
or cloaking [12,46] as a potential risk for valid results.
Table 1: Measurement characteristics in 60 peer-reviewed
studies that are built upon OpenWPM
Category Studies
– HTTP traffic 49
– cookies 30
– JavaScript 17
– other 5
Run mode
– unspecified 49
– virtualisation 14
– headless 5
– regular mode 3
– Docker 2
– Xvfb 2
Category Studies
– no interaction 48
– clicking 9
– scrolling 7
– typing 4
– not visited 45
– visited 15
Bot detection
– ignored 46
– discussed 14
uses mitigation 7
3 Related Work
Determining the fingerprint surface of web bots.
Browser fingerprinting [24] has been studied extensively in
the context of user tracking, as recently summarised in [44].
The idea of using fingerprinting to identify certain client com-
ponents (such as automation frameworks) has gained more
attention recently. Vastel [79] and Shekyan [64] conducted
manual investigations of headless browsers to pin down iden-
tifiable properties these frameworks. Jonker et al. [39] auto-
mated the search for identifiable properties by using a browser
fingerprinting library. They compared properties of regular
browsers against properties of bots that belong to the same
engine class. In contrast, Schwarz et al. [63] applied a new
form of fingerprinting (JavaScript template attacks) to per-
form client-side vulnerability scanning. For a template cre-
ation, they traverse object hierarchy and store characteristics
of each object. Later on, templates can be compared to deter-
mine the differenbce. Finally, Vastel et al. [80] inspected bot
detectors in the wild to collect known identifiable properties.
They used these to systematically test the responses by bot
detectors to their changes.
Our work comprises the both automated approaches, by
Jonker et al. and Schwarz et al., to explore the fingerprint
surface of OpenWPM. We apply these systematically to the
various run modes of OpenWPM clients, uncovering distin-
guishers for each mode.
Measuring bot detection in the wild.
Two studies exist
that carried out a large-scale investigation of the existence of
unknown fingerprint-based bot detectors. Jonker et al. [39]
scanned 1M websites gathering statically included scripts
and analysing these using static code analysis. Shortly after,
Jueckstock and Kapravelos [40] presented a similar experi-
ment using dynamic script collection and dynamic analysis.
Their presented tool relies on a modified V8 engine to instru-
ment browser functions.
Reliability of scraping results.
Recently, multiple studies
have been conducted that explore differences between various
automated clients, and also between automated clients and
human-driven clients in website responses. Ahmad et al. [5]
investigate response differences between three classes of bots
(HTTP engine tools, headless browsers, and automated native
browsers). They found that while HTTP engine tools miss
many important resources, they more often pass bot detection
than the other two classes. Jueckstock et al. [41] studied dif-
ferences between headless Chrome and regular Chrome. For
regular Chrome, they used a puppeteer-plugin which hides dis-
tinguishable properties in Chrome to focus on bot detection.
Their results reinforce previous recommendations [27, 79]
to not use headless browsers. Zeber et al. [82] contrast data
from human users with OpenWPM clients. In their study,
OpenWPM clients encountered three times more tracking
domains and had more interaction with third-party domains
than human-controlled browsers. Cassel et al. [14] investigate
the reliability of emulated browsers. To avoid bot detection,
they created their own tooling to remotely control a browser.
Interestingly, their observations show the opposite of Zeber et
al.s findings. They observed 84% less third party traffic for a
Selenium-driven vs. a non-Selenium-driven Firefox browser.
This contradiction shows that there is yet no consistent picture
for the influence of bot detection on measurements. Further
investigation to resolve this conundrum is needed. Any such
investigation necessitates tooling that can evade bot detection.
We aim to develop such a tool for OpenWPM.
4 Fingerprint surface of OpenWPM
We begin by addressing the research question how can Open-
WPM be distinguished from human-controlled web clients?
In general, a web site operator looking to identify OpenWPM
clients can either probe for identifiable properties (i.e., fin-
gerprinting), or attempt to recognise OpenWPM’s interaction.
The latter is due to Selenium, whose interaction was studied
in detail by Goßen et al. [34]. Those results fully carry over to
OpenWPM. This leaves uncertainty about how OpenWPM’s
fingerprint distinguishes it from other clients and other bots.
In line with previous works, we call that part of a browser
fingerprint that distinguish a certain type of client from other
types the fingerprint surface [72]. Determining the fingerprint
surface of an OpenWPM client requires a way to find its prop-
erties that deviate from properties and values in other clients.
Jonker et al. [39] showed that it suffices to consider differences
Table 2: Summary of deviating properties of each OpenWPM
setup contrasted with OpenWPM’s Firefox version
macOS Ubuntu Docker
navigator.webdriver is true X X X X X X
screen dimension prop. X X X X X X
screen position prop. X X X X X X
font enumeration X
timezone is 0 X
navigator.languages prop. 43 43
deviating WebGL prop. 2037 2061 18 27
With instrumentation:
- through tampering +253 +253 +252 +252 +252 +252
- added custom functions +1 +1 +1 +1 +1 +1
RM: Regular mode; HM: Headless mode; Xvfb: X virtual frame buffer
within the client’s ‘browser family’, that is, fingerprint differ-
ences with those clients who use the same rendering engine
and JavaScript engine. By comparing the results for multiple
clients of the same browser family, differences unique to each
client are brought to light. In previous works, two approaches
for browser fingerprinting were used: probing a specific list
of properties [39], or using an automated approach for DOM
traversion [63]. While there is overlap between the results of
these methods, neither offers a complete superset of the other.
We combine the results of both approaches to determine the
fingerprint surface.
4.1 RQ1: How recognisable is OpenWPM?
We determine OpenWPM’s fingerprint surface by compar-
ing its client to a standalone version of the same Firefox
browser. Any differences must originate in the hosting envi-
ronment, the framework itself, the base implementation, the
added automation, or measurement components. To account
for possible effects of the various run modes of OpenWPM
on the fingerprint surface, we determine variations for each
setup on Ubuntu and macOS. Table 2 summarises identify-
ing properties found in the current version for each mode. In
addition to ways to recognise OpenWPM’s instrumentation,
we also identify ways to recognise Selenium and WebDriver,
display-less scraping (headless or Xvfb mode), and use from
within a virtual environment. Thus, every mode of running
OpenWPM is identifiable as a web bot.
Recognising automation components.
We found three
identification measures that may be applied against all modes
of OpenWPM. First, the
property in-
dicates a browser controlled via the WebDriver interface.
Second, screen properties use standard values and cannot be
changed from OpenWPM (see Table 3). For example, the
browser window position for macOS is 4px from the top and
23px from the left. On macOS, all browser instances will use
the same absolute coordinates, while on Ubuntu, each window
is shifted by the same offset, when using regular mode.
Table 3: Screen properties for various configurations
OS Mode Resolution Window X Y Offset (x, y)
macOS Regular 2560 x 1440 1366 x 683 23 4 0, 0
Headless 1366 x 768 1366 x 683 4 4 0, 0
Ubuntu Regular 2560 x 1440 1366 x 683 80 35 8, 8
Headless 1366 x 768 1366 x 683 0 0 0, 0
Xvfb 1366 x 768 1366 x 683 0 0 0, 0
Docker 2560 x 1440 1366 x 683 0 0 0, 0
Identification of missing displays.
Suppressing output to
display (by using Xvfb, headless, or Docker) adds a signifi-
cant number of differences. In headless mode, the lack of a
WebGL implementation leads to thousands of missing prop-
erties. We also observe that this mode adds 43 new properties
to the
object. Xvfb mode uses a reg-
ular Firefox browser, which contains WebGL functionality.
Nevertheless, Xvfb mode causes 5 changed and 13 missing
properties. Interestingly, both headless and Xvfb mode allow
the detection of missing user elements by accessing the prop-
erty screen.availTop. This describes the first y-coordinate that
does not belong to the user interface
. In display-less modes,
this is always zero, while regular browsers have larger values.
Traces of virtualisation.
Using OpenWPM’s docker con-
tainer causes the WebGL vendor property to contain the term
VMware, Inc.
(cf. Table 4) – clear evidence for the use of
virtualisation. In addition, the Docker environment reduces
the number of available JavaScript fonts to one (Bitstream
Vera Sans Mono), nor does it provide information about the
time zone.
Table 4: Selected deviations in display-less modes on Ubuntu
Mode WebGL vendors avail{Top|Left}
HM Null 0, 0
Xvfb Mesa/ llvmpipe (LLVM 12.0.0,. . . ) 0, 0
Docker VMware, Inc. llvmpipe (LLVM 10.0.0,. . . ) 27, 72
Detecting instrumentation.
We checked if using any of
OpenWPM’s various instruments has any effect on its fin-
gerprint surface. The only differences occur when using the
JavaScript instrument. First, this instrument overwrites cer-
tain of the browser’s standard JavaScript objects, which can
be detected by using the
function of a function
Figure 2: Properties in a (A) original object or (B) by the
instrumentation polluted object.
or object (see Listing 1). Another identifying aspect of this
instrument is the presence of a function in the window ob-
ject (
), which is not present in any
common desktop browser (Firefox, Safari, Chrome, Edge,
Opera). Third, OpenWPM’s wrapper functions can be found
in stack traces. For that, a script need to provoke an error
in any overwritten function and catch the stack trace to suc-
cessfully identify a modification by OpenWPM. Lastly, the
instrument ‘pollutes’ prototypes along the prototype chain of
an object. Instrumenting is done by changing the prototype of
an object, as well as all its ancestor prototypes. However, the
properties of later ancestor prototypes are all added to the first
ancestor prototype (cf., Fig. 2). This distinguishes a visitor
with instrumentation from one without.
We validate whether the identified fingerprint
surface works in practice to identify OpenWPM. For that,
we implemented a OpenWPM detector, that uses four tests
to identify OpenWPM amongst web clients: (1) test for the
presence of a DOM property, (2) test for a missing DOM
property, (3) test if a native function was overwritten, (4)
compare a DOM property with an expected value.
We tested the detector by setting up four machines, 2 Mac-
intoshes and 2 PCs with Ubuntu. On each machine, we used
OpenWPM and common browsers (Chrome, Safari, Opera
and Firefox). We tested each distinguishing property from
Table 2. Our detector site was able to correctly identify Open-
// output of .toString when not instrumented
"function getContext() {
[native code]
// output of .toString when instrumented
"function () {
const callContext = \
logCall(objectName + "." + \
methodName, arguments, callContext, logSettings);
return func.apply(this, arguments);
Listing 1: Detectability of OpenWPM’s JavaScript
WPM every single time. Almost all properties uniquely iden-
tify OpenWPM, except for a few WebGL- and screen-related
properties. For a few WebGL properties (roughly 200 of
4K), we found that these also occur on some non-OpenWPM
clients. Ignoring all such properties still leaves a large number
of identifying properties.
4.2 RQ2: How stable is the fingerprint sur-
We explored how stable our determined fingerprint surface
is, as new Firefox and OpenWPM versions may appear fre-
quently. To that end, we repeated our experiments for older
version of OpenWPM (0.11.0 and 0.10.0). In general, we
found that the fingerprint surfaces largely overlap. For exam-
ple, on MacOS, the number of WebGL deviations in headless
mode increases to 2037 in OpenWPM 0.17.0, from Open-
WPM 0.11.0’s 2022. In the oldest OpenWPM version (0.10.0),
we find that the JavaScript instrument adds two properties
instead of one to the window object (
). In addition, we also in-
vestigated whether using an unbranded browser (as Open-
WPM does) impacts OpenWPM’s fingerprint. We found no
differences between branded and unbranded Firefox versions.
Using outdated browsers, however, does impact the finger-
print. For example, Google’s reCAPTCHA service assigns a
higher risk to older browser variants [65]. In the past, Open-
WPM’s integrated Firefox version has been behind the official
release of Firefox several times (cf. Table 11 in Appendix B).
We found that OpenWPM used an outdated Firefox browser
71% of the last 20 months. In short, this distinction vector
should be expected when using OpenWPM.
5 Incidence of OpenWPM detection
To assess the extent of OpenWPM detection in the wild, we
conduct a large-scale measurement for client-side bot detec-
tion. In detail, we focus on scripts with capabilities to detect
OpenWPM, i.e. scripts with routines to access properties
unique for Selenium-based bots and/or OpenWPM. We find
both general Selenium detectors and OpenWPM-specific de-
5.1 Data acquisition and classification
Previous automated approaches [39,40] to
identify bot detectors have either relied on static or dynamic
analysis. The idea behind static analysis is to identify code
patterns in source code that link to known bot detectors or
that use specific bot-related properties. A limitation is that
scripts may create code dynamically, which will be missed
out by static analysis. Moreover, minification and obfusca-
tion further increase the false negative rate of static analysis.
The alternative approach, dynamic analysis, is to monitor
JavaScript calls that identify a script as bot detector based
on access to bot-related properties. Dynamic analysis does
cover dynamically-generated scripts. Moreover, it does not
monitor the code itself, but only executed calls. An upside
of this is that neither minification nor obfuscation affects the
analysis. On the other hand, code that happens not to be exe-
cuted during the run, is not analysed. Both static and dynamic
analysis have been able to identify some bot detectors in the
wild. It is not clear whether and to what extent the results of
the methods differ in practice for finding web bot detectors.
We combine both methods to increase coverage.
In order to assess the extent of client-side bot detec-
tion, we scan the top 100K websites of the Tranco list [45]
We set up an instance of OpenWPM running Firefox in regu-
lar mode. During a site visit, our OpenWPM client stores a
copy of any transmitted JavaScript file and records JavaScript
calls. We add an initial waiting time of 45 seconds after a
completed page load to give websites enough time to perform
JavaScript operations. In addition, we instruct our client mea-
sures the presence of bot detection on subpages by opening a
maximum of three URLs extracted from a site’s landing page.
For selecting subpages, we consider only URLs linking to
the same domain. Within this and the following sections, we
apply scheme
to identify a domain. To account for
websites that use same origin requests to redirecting clients
to foreign domains, our client checks if a foreign domain was
entered after following all redirects.
Scripts should be classified as bot detectors if they access
the fingerprint surface of OpenWPM. However, certain
scripts may access these attributes for other purposes, such as
checking supported WebGL functionality. To reduce such
false positives, we only classify a script as bot-detecting
when it accesses properties pertaining to browser automation
or are unique to OpenWPM (cf. Sec. 4.1). This leaves only
the following:
, which is specific to
WebDriver-controlled bots; and the new identifying properties
introduced by OpenWPM’s JavaScript instrumentation:
. Table 5 shows the results of the data
collection and classification.
Inherent in the above approach are several as-
sumptions that can impact the results. First, our approach
relies on the fingerprint surface we established. Detectors
based on other methods (e.g., mouse tracking [16]) will be
missed. Second, we do not account for cross-site tracking. A
third-party tracker could classify our client as a bot on one
site and would need only to re-identify the client on another
site, e.g., using IP filtering or regular browser fingerprinting.
This amounts to a form of website cloaking – serving different
content to specific clients. To what extent third party tracking
in general employs cloaking is a different study and left to
future work. Both these limitations may cause underestima-
tion of the number of detectors (false negatives). As such,
our approach approximates a lower bound on the number of
detectors in the wild.
Preprocessing for static analysis.
Within the static analy-
sis, we pre-process scripts to undo straightforward obfusca-
tion. We derive the respective encoding, transform hex literals
to ASCII characters, and remove code comments. We apply
our static analysis to scripts that we collected during our scan
of the Tranco Top 100K, which resulted in 1,535,306 unique
scripts. To identify Selenium-detector scripts, we then use
patterns to look for access to
details can be found in Appendix C).
Using honey properties to catch iterators.
For the dy-
namic analysis, every recorded access to the fingerprint sur-
face identifies a script with the potential to detect OpenWPM
as a bot. This will also be triggered by scripts that iterate
over all properties, e.g., for regular browser fingerprinting
(re-identification). Determining the purpose of such iteration
requires per-script manual inspection and goes beyond dy-
namic analysis.
To determine whether property iteration takes place, we
extend our client’s navigator and window object with ‘honey’
properties. These honey properties are added on the fly and
use random strings as name. Hence, only a script using prop-
erty iteration would access all honey properties. We assign
scripts that use property iteration into three categories, based
on access to the
property: definitely
detecting bots, and inconclusive. Iterator scripts are classified
as inconclusive if they do not access
as all accesses to the fingerprint surface could be due to prop-
erty iteration. Scripts that iterate the navigator object will
naturally access the
property. To check whether
this access is only by iteration or intentional, we distinguish
between scripts that trigger our static analysis and those that
do not. Only scripts that do not surface in the static analysis
are classified as inconclusive.
5.2 RQ3: How often is OpenWPM detected?
Our results show that, when checking both front- and sub-
pages, at least 16.7% of websites in the Tranco Top 100K
execute scripts that accessed properties specific to Selenium
and, thereby, OpenWPM. Moreover, we also find scripts ac-
cessing OpenWPM-specific properties.
Table 5: Number of websites with Selenium detectors
# sites static dynamic union
identified 32,694 19,139 38,264
without false positives / ‘inconclusive’ 15,838 16,762 18,714
Table 6: Number of sites ordered by script domains accessing
OpenWPM-specific properties
cz gs ad1t
total 331 14 9 2
jsInstruments 331 5 2 2
instrumentFingerprintingApis 0 6 4 0
getInstrumentJS 0 3 3 0
cz:, gs:, ad1t:
OpenWPM-specific properties are accessed in the wild.
Most scripts we found recognise OpenWPM by targeting Se-
lenium. A small number of detectors, also include specific rou-
tines to detect OpenWPM itself. Overall, 356 sites executed
scripts that accessed OpenWPM-specific properties. These
scripts were all included via third-party domains, belonging
to four distinct providers. Table 6 summarises these detectors
and their detection method. Detectors on were
found by both static and dynamic analysis; detectors on the
other three domains used some form of minification, obfus-
cation, and/or dynamic loading, and were only found by dy-
namic analysis. We investigated the four hosting domains by
records, EasyList,
and the WhoTracksMe
database [17]. All domains are related to the advertising indus-
try. The domain
belongs to CHEQ, a company
fighting ad fraud. The scripts hosted by Google domains are
included through Google’s reCAPTCHA service. While we
could not clarify the origin of
, we found
this domain listed in the EasyList for ad domains.
14% of sites have bot detection on the front page.
ure 4 depicts the distribution for detectors active on the
front page of websites for static and dynamic analysis. Dy-
namic analysis without considering property iteration iden-
tifies 12,208 sites with detectors on the front page. Static
analysis measures the number of sites where bot detection
could be triggered (11,897), including those where detection
is present but not (yet) executed, e.g., where detection is only
triggered after hovering over certain elements. While both
static and dynamic analysis identify a similar number of de-
tectors for each bucket, they do not fully overlap. Combining
both provides a slight increase in the presence of detectors
(1.7K sites).
Figure 3: Number of sites with bot detectors on front- and
subpages (depicted per 1K sites)
Figure 4: Detectors found on front pages
Deep scanning increases the rate of detection by 5 per
cent points.
As discussed in Section 2, 15% of studies con-
ducted with OpenWPM (also) investigated subpages. This
raises the question whether such studies are more often sub-
ject to bot detection, that is: does bot detection occur more
frequently on subpages? Figure 3 depicts the occurrence of
bot detectors on front pages and subpages. In general, studies
examining subpages are at greater risk to be detected: the
number of sites with active detectors increases for by at least
37%. Hence, the average detection rate within the Top 100K
sites will increase. That is: the study will be exposed to more
detectors. Combining the results of both measurements, we
see an increase of 5 per cent points (from 14% to 19%).
5.3 RQ4: By whom is OpenWPM detected?
To explore this question, we separated detectors into first and
third parties. We find that the majority of sites includes detec-
tors from third-party domains. We count how often scripts on
these third-party domains are included on scanned sites, tal-
lying each third-party domain once per including site. Some
sites include more than one detector, hence the total num-
ber of inclusions exceeds the number of sites with detectors.
Overall, we count 3,867 first-party detector scripts and 21,325
third-party detector scripts.
Figure 5: Common categories of sites including detectors
First and third-party bot detection are used differently
among the industry.
We further explore what sites include
detectors, as this may provide a better view on what bot de-
tection is used for. For that, we collect categories for the
identified 16K websites with detectors based on Symantec’s
site review service (
Sites may be assigned multiple categories; for such sites, we
tally each listed category. Figure 5 depicts the 16 most often
tallied categories for both first-party detectors (4,198 times)
and third-party detectors (16,323 times). We find that news
sites are responsible for 18.4% of all third-party inclusions,
followed by Technology (9%) and Business (7%). Interest-
ingly, the ranks for Shopping (16.4%) and News (5%) switch
for first-party detector inclusions. Moreover, sites in the cate-
gories Finance (8% vs 3%) and Travel (7% vs 2%) make up
for a larger portion in the set of first-party inclusions than for
third parties.
We believe that these uneven distribution of inclusions is
explainable. While every site owner will want to protect their
site from nefarious bots (and thus reason to include first-party
detection), advertising has become a popular business model
for websites. For such sites, third parties have a vested interest
in detecting bots: to detect ad fraud. Thus, on such sites, one
would expect more third-party bot detection.
Third-party bot detection typically serves the advertise-
ment industry.
Following up on the previous point, we in-
vestigated the origins of third party detectors. Table 7 breaks
down the most common included domains. The top 10 do-
mains account for two third of inclusions. The site Who- [17] categorises trackers according to purpose.
Using this, we find that the bot-detecting scripts on the most
commonly included domains can serve a variety of purposes.
For example,
offers scripts used for advertising,
content delivery network, site analytics, social media, and oth-
ers. Other uses include web analytics (, CDN
( and live chat ( However, bot
detection is most commonly deployed by advertisers (e.g.,
domains 2,3,4,7,9, and 10 in Table 7).
Table 7: Domains hosting 3
-party detector scripts
hosting domain # inclusions (1/site) %
all 21,325 100%
1 3,848 18.04%
2 2,309 10.83%
3 2,165 10.15%
4 2,091 9.81%
5 1,552 7.28%
6 1,061 4.98%
7 854 4.00%
8 423 1.98%
9 416 1.95%
10 402 1.89%
11+ remaining 704 domains 6,204 29.1%
The vast majority of first-party detectors are embedded
third parties.
To determine the origins of first-party bot
detection scripts, we look for similarities between their inclu-
sions of detectors. To do so, we hash the scripts and check
for structural similarities in script URLs (for more details see
Appendix A). We found various similarities amongst unre-
lated sites. Scripts originating from Akamai occur the most
frequent (1,004 sites). Second is Incapsula (998 sites), third is
an unknown bot detector (659 sites), and fourth is Cloudflare
(486 sites). Together, these top three originators account for
3,147 out of 3,867 sites (88%) where we found first-party
detectors. In contrast to the purpose of third-party detectors,
first-party detectors are not supplied by advertisement compa-
nies. Moreover, Akamai, Incapsula and Cloudflare all offer
commercial bot detection services. With that in mind, one
should expect sites with first-party detectors to likely tailor
their responses for detected bots (e.g., throttling, blocking,
withholding resources, and serving CAPTCHAs).
6 Attacking JavaScript recording
We investigate whether a malicious website or third party
could corrupt OpenWPM’s data collection process. In partic-
ular, we consider an attacker that can deliver arbitrary content
(HTML, cookies, JavaScript), but cannot break the browser’s
security model. To do so, our focus resides on attacks against
the integrity or completeness of measurements. More specif-
ically, we aim to attack the resilience of OpenWPM’s most
commonly used instruments: HTTP traffic, cookie record-
ing, and JavaScript call recording. Both HTTP and cookie
instruments are simple wrappers around browser functionality.
Breaking them thus requires breaking the browser, which is
outside the attacker model. The JavaScript instrument, on the
other hand, needs to supply all its monitoring functionality
itself. It is therefore clearly in scope of our attacker model.
Since the instruments focus on data recording, we investi-
gate attacks on data recording. More specifically, we consider:
1. whether data recording may be prevented;
whether fake data can be injected into the data recorder;
whether already recorded data can be deleted or altered;
4. finally, whether the data recording is complete.
Instruments in OpenWPM are implemented as a browser
extension. Extensions are isolated to protect higher privilege
APIs from access by untrusted code. Website scripts thus
cannot directly interact with extensions. However, both ex-
tensions and website scripts can read and change the DOM,
opening the door for injection attacks against extensions that
read the DOM. We conducted source code analysis for each
instrument under investigation to identify vulnerabilities to
such attacks. Below we discuss the found vulnerabilities.
6.1 RQ5: How to prevent data recording?
We found a vulnerability that, when successfully exploited,
allows a website to break OpenWPM’s data recording hooks.
The vulnerability can be leveraged to turn off recording of
JavaScript calls in the JavaScript instrument.
More specifically: the JavaScript instrument overwrites
several API functions by hooking into the DOM’s event dis-
patcher (to record access to them). The event dispatcher then
sends messages to be recorded back to the JavaScript instru-
ment’s back end. To prevent an attacker from silently undoing
these hooks, OpenWPM also hooks into (and thus: records ac-
cess to) setters and getters to these API functions themselves.
However, the event dispatcher itself is not protected. We thus
can alter the event dispatcher to inject our own messages and
manipulate messages sent to OpenWPM (cf., Listing. 2). To
carry out this attack, the attacker overrides the event dispatcher
to block all messages (all events from instrumented objects).
This would already block OpenWPM recording, by breaking
any JavaScript API calls. However, this also would break a
website’s own JavaScript. To block only OpenWPM mes-
sages, the block needs to be tailored. Conveniently, tags mes-
sages with an ID to identify any monitored objects. Though
this ID is randomly generated, it can easily be determined:
simply trigger an API call to a monitored object, acquire the
random ID from the observed message, and update the event
dispatcher to only block messages containing this ID.
6.2 RQ6: Can fake data be injected?
The previous attack, altering the event dispatcher, not only
allows an attacker to block data recording, it also allows an
attacker to learn the ID OpenWPM uses to record data. This
is sufficient to inject almost arbitrary messages to be recorded.
The attacker simply creates a custom event following the for-
mat used by OpenWPM’s JavaScript extension and includes
OpenWPM’s assigned event ID. This enables an attacker to
//Step I: Retrieve OpenWPM's random ID
function grabID() { return new Promise((resolve, reject) => {
let id;
document.dispatchEvent = function (event) {
id = event.type; document.dispatchEvent = dispatch_fn;
if (id !== undefined) { resolve(id);
} else { reject(new Error(msg));}
// Perform an action to grab the ID
// Step II: Overwrite event dispatcher to block events
async function attackExtension() {
let id = await grabID();
document.dispatchEvent = (event) => {
if (event.type != id) { dispatch_fn(event); // Dispatch event
} else {console.log("Event swallowed: " + event);}}}
Listing 2: Turn off the script recorder
define most of the content of the resulting entry in Open-
WPM’s recording, such as the executing script URL or which
function was called. Crucially, though, the website that orig-
inated the call is set outside of the browser by OpenWPM.
The data sent by the event dispatcher is properly sanitized by
the back-end, which prevents spoofing this. We can thus only
inject fake data for the currently visited website. Note that a
third party included on the site can also execute this attack.
6.3 RQ7: Can records be deleted or altered?
Whereas the previous attacks exploited a vulnerability in
the DOM-parsing front-end of the respective instruments,
deleting already recorded data requires manipulating an in-
strument’s back-end, for OpenWPM: SQLite. Attacking a
database back-end requires an SQL injection vulnerability.
As already mentioned, OpenWPM’s data recording back-end
properly sanitizes its inputs. This means that there is no pos-
sibility for an SQL injection via JavaScript recording. There-
fore, we conclude that it is not feasible to delete or alter al-
ready recorded data from OpenWPM’s SQLite database.
6.4 RQ8: Is data recording complete?
We investigated whether data recording is complete. We found
two different attacks against completeness: existence of unob-
served channels, and silent delivery of JavaScript code.
Existence of unobserved channels:
During our evalua-
tion, we found a way to bypass OpenWPM’s recording of
JavaScript function calls. This attack again exploits Open-
WPM’s hooks to record function calls. In particular, the hooks
must be attached to every object that is to be observed. For ev-
ery new window or iframe, this must be done afresh. However,
there is a long-standing bug in Chrome and Firefox (cf., [67]),
where both browsers under some circumstances fail to inject
scripts into iframes. We tested if OpenWPM’s implementation
is affect by this and we found that this is indeed the case.
Our evaluation of this attack involves two different ways to
access an iframe’s DOM
to create/execute iframes and their
code: static vs. dynamic creation and immediate vs. delayed
execution. Of these, immediate code execution (at creation
time) is required to successfully exploit this bug. None of the
other parameters we tested influenced the result. Listing 3
shows a proof-of-concept of this type of attack.
Silent delivery of JavaScript code:
Note that the aforemen-
tioned attacks based on JavaScript would appear in Open-
WPM recordings, if the HTTP instrument is used. Namely,
that instrument collects response bodies. That is, unless this
instrument’s recording can also be bypassed. We indeed
managed to extend our previous attacks to be silently trans-
ferred to OpenWPM. For that, we looked at the two options
that OpenWPM offers two to collect response bodies. Open-
WPM either stores all response bodies (full coverage), or
it can be set to store JavaScript files only. The latter op-
tion significantly reduces stored content. For this mode, we
found that an attacker can silently deliver JavaScript code
by sending it as text and processing it client-side, e.g., by
including a line like
<link src=""
in the HTML source. To
successfully bypass OpenWPM’s traffic recording of JS files,
three aspects must be accounted for:
The content-type attribute must be set to something other
than text/javascript;
ii. The src attribute must not contain a “.js” extension;
the delivered file is not automatically executed; this must
be handled by a different client-side script (e.g., using
// Operation will not appear in the recordings.
setTimeout(() => {
let element = document.querySelector("#unobserved");
let iframe = document.createElement('iframe');
// HTML code for instantiating an iFrame
iframe.src = "unobserved-iframe.html";
}, 500);
Listing 3: Example of an unobserved channel
7 Improving OpenWPM’s reliability
This section focuses on OpenWPM’s reliability as an instru-
ment measuring the web as encountered by regular visitors.
We explore how and to what extent reliability can be improved.
To do so, we design an approach to hardening OpenWPM’s
instrumentation and to hiding its distinctive fingerprint (from
here on referred to as WPM
). Our proof-of-concept suc-
cessfully hides the telltale signs of OpenWPM from its finger-
print and makes OpenWPM robust in the face of the discussed
window.frames[0], and frame.contentWindow
attacks in a lab setting. To evaluate its effectiveness in an open
world setting, we run WPM
against detectors in the wild
and contrast its measurements with those of a regular Open-
WPM client.
7.1 RQ9: How to hide the fingerprint surface?
OpenWPM’s characteristic fingerprint varies with the vari-
ous modes of running OpenWPM. For example, in headless
Firefox mode, the fingerprint surface is difficult to hide due
to headless mode’s lack of functionality when compared to
regular browsers. Hence, we focus on run modes where Open-
WPM runs the browsers natively (Regular Mode). For such
modes, we achieve stealth by overriding properties without
leaving traces. These techniques can also be applied in other
run modes (e.g., virtualisation).
The identifying properties for Regular Mode (see Ta-
ble 2) relate to the
property, window position,
and dimension. Of OpenWPM’s various instruments, only
the JavaScript instrument causes further identifiable proper-
ties. Hiding these properties can be achieved by a customized
browser, or by including additional code inside a page’s scope.
Implementing the former requires significant work, but it can
hide the fingerprint near-perfectly. The latter approach is far
simpler to implement but risks leaving residual traces. For
our proof-of-concept, we choose the second option, as it can
be seamlessly integrated within the current OpenWPM frame-
work without significant effort.
Our proof-of-concept must address two aspects: hiding
the automation components and preventing detection of in-
strumentation. To prevent detection of instrumentation, four
issues need fixing (Sec. 4.1): (1) calling the
eration of overwritten functions must return the regular out-
put string for browser functions; (2) no additional property
may appear in the DOM; (3) stack traces must not show any
signs of the instrumentation; (4) prototype pollution must be
avoided. Lastly, hiding instrumentation requires hiding their
detectable aspects, similar to how
must appear un-
Preserve toString output.
For the first issue, we found
that CanvasBlocker
addresses this well. Its implementation
successfully fools all our fingerprinting tests (Sec. 4). Can-
vasBlocker creates a getter function with an identical signa-
ture to the function that must be overwritten and attaches
it to the DOM based on a specific Firefox feature called
. The newly exported function is then used
to redefine the getter of a object’s prototype for a specific
property. As a result, the overwritten function returns the na-
tive code string like a default browser property (cf., Listing 1).
Normally, accessing the getter of an object’s prototype leads
to an error. If this getter is replaced with a custom getter, that
error is never thrown. This makes tampering with properties
via an object’s prototype detectable [34]. Calling the original
getter from the customised getter results in the original error
being thrown, addressing this aspect of the fingerprint surface.
Preserve clean DOM.
The second issue arises during page
load, prior to the page’s JavaScript activation. The instru-
mentation injects its code as script from the content context
into the page context, overwrites the needed properties, and
removes its code from the page context again. However, in
practice, not all injected functions are deleted. We update
the instrument to overwrite all functionality directly from the
content context, thus keeping the page context clean.
Faking stack traces.
The third issue requires the stack
trace to show no signs of instrumented functions. A web
page can only access stack traces if errors occur. Normally,
if an error occurs, the stack trace would show that the called
function is called from inside the instrumentation. We address
this by catching each error and throwing a new error with
properly adjusted values for file name, column, message, and
line number.
Avoid prototype pollution.
The last issue relates to the pol-
lution of an object’s prototype, as OpenWPM’s instrument
modifies only the first prototype in the prototype chain. We
address this by overwriting properties per prototype. Unfortu-
nately, this approach has its own limitation, as it is not possible
to determine the caller of a function, when a prototype has
multiple children. Especially for prototypes located higher up
the chain, the number of children increases; raising the poten-
tial to capture unwanted API calls on other children objects.
To test our implementation, we instrumented the same API
calls as used by OpenWPM. Luckily, most of our these APIs
are provided by prototypes close to the bottom, which allows
us to cover a wide set of OpenWPM’s instrumented APIs.
Preventing detection of automation components.
automation components are detectable by window size, win-
dow position and the webdriver attribute. For the latter, our
hidden version must set the
to false like a regular Firefox browser. Since Firefox ver-
sion 88, this flag is not user-settable.
We override the getter
function of the
property in the same
fashion as described in the previous section. To change Open-
WPM default window settings, we introduce a settings file that
makes the window size and position settable in OpenWPM.
7.2 RQ10: How to mitigate recording attacks?
Securing messaging from page context to background
context (see Sec. 6.1, 6.2).
A key benefit from migrating
to Firefox’s
, as described in the previous
section, is the ability to export higher privileged browser func-
tions into the page. Hence, we can port functionality to the
page context that is otherwise only available for content or
background scripts of a browser extension. We use this to
secure our instrumented functions, as we now can use the
API to pass messages from the page to
the background context. It is crucial that such functionality
is exported to a private scope of an overwritten function to
prevent access by other scripts in the page context. This pre-
vents the ‘turn recording off’ and ‘inject fake data’ attacks, as
an attacker cannot manipulate message transmissions to the
background script.
Improving coverage of the JavaScript instrument (see
Sec. 6.4).
To address the tested variants of incomplete
recordings, we use CanvasBlocker’s frame protection. The
basic idea is to intercept APIs used by page scripts to mod-
ify the DOM or create a new, non-instrumented copy of the
DOM. This ensures that each modification or newly con-
structed DOM contains the instrumentation. Our implemen-
tation covers five cases: window constructors, DOM modi-
fication API, window mutations, and DOM creation via the
document.write API, and finally the API.
Filtering of the HTTP file recorder (see Sec. 6.4).
To the
best of our knowledge, there is no known way to distinguish
JavaScript code from text that is robust against a dedicated
obfuscator. Therefore, an active adversary should be assumed
to be capable of hiding JavaScript in a way that would ac-
cidentally be filtered out. Since this issue only arises in the
presence of active adversaries, we recommend in such a case
not to use any filtering.
7.3 Evaluation of PoC implementation
We developed a proof-of-concept implementation to hide
the tell-tale signs of automation and to mitigate the found
attacks. We evaluate the impact of our proof-of-concept im-
plementation (from here on, WPM
) on web measurements
when encountering bot detection in the wild. To that end, we
contrast its results with vanilla OpenWPM (from here on:
WPM) in HTTP traffic, cookies, JavaScript execution, and
delivered JavaScript files. We test on all sites with bot detec-
tors (as found by dynamic analysis) from the Tranco Top 5K
(see Sec. 5). This list contains 1,417 sites with either first-
party or third-party detectors. On these sites, we run WPM
and WPM
in parallel (OpenWPM v.0.18.0, regular mode,
HTTP, JavaScript and cookie instrument activated) and con-
figure each browser to idle 60 seconds on a page after loading
completed. We take steps to mitigate noise in measurements.
In particular, we avoid cross-client interferences by separating
both crawlers via two individual machines and IP addresses.
Each IP address belongs to a residential network and comes
from the same municipal and internet provider, which avoids
differences caused to cloud-based IP blocking [37] and geo-
location. Secondly, we re-synchronise the machines every 100
visit. This ensures that sites are loaded roughly simultaneously
on both machines (max. offset is below four minutes).
Sites that detect OpenWPM serve less media resources.
In our experiment, we found that WPM
encounters 3.45%
more HTTP requests. As our data set is not normally dis-
tributed, we tested for significance using Wilcoxon signed-
rank test with a confidence interval of 95%. For that, we
divided the traffic into first and third-party requests and find
significant differences for HTTP requests to both first- and
third-parties (
0.0001). In more detail, we found for
WPM, 175 sites (12%) lead to more first-party and 472 sites
(33%) to more third-party requests. For WPM
, we count
400 sites (28%) with more first-party and 654 sites (46%) with
more third-party requests. This indicates a stronger variability
in third-party traffic, leaning towards less detectable clients.
Table 8 shows requests for each machine per requested re-
source type.
The table shows that WPM
receives roughly
double the number of audio and video files (type media).
Moreover, requested images (image and imageset) is in-
creased by
3%, and executable code (script) by
4%. More-
over, WPM incurs three times the number of CSP violations –
though this may also be due to embedding more JavaScript in
the page context. Finally, the difference in websocket requests
is due to a single outlier. Thus, we do not expect websocket
requests to change significantly between WPM and WPM
Equivalent amount of ads/trackers traffic.
To assess the
amount of trackers and advertisers in traffic, we use the same
approach as previous works [5, 14,41]: use the EasyList and
EasyPrivacy blocklists
to identify trackers. Our results show
that WPM and WPM
encounter a near equal rate of adver-
tisers and trackers. For WPM, ads and trackers account for
14.3% and 11.6% of total traffic. For WPM
, this is 14.2%
and 11.5%, respectively – almost equivalent.
Large differences in served cookies.
For cookies, we con-
trasted the number of cookies between both variants. We
found that these differ significantly for both first parties and
third parties (
0.0001). Specifically, 305 sites serve
9 o
Table 8: Comparison of HTTP request resource types
Resource type WPM WPM
csp_report 884 298 -66.29%
websocket 467 242 -48.18%
media 378 552 +46.03%
beacon 3,804 4,453 +17.06%
imageset 4,888 5,432 +11.13%
xmlhttprequest 46,199 49,398 +6.92%
script 73,527 76,430 +3.95%
object 53 55 +3.77%
other 92 95 +3.26%
main_frame 3,883 3,757 -3.24%
image 101,256 103,801 +2.51%
sub_frame 11,119 10,885 -2.10%
stylesheet 9,663 9,840 +1.83%
font 9,557 9,704 +1.54%
Total 265,770 274,942 +3.45%
more first-party cookies, while only 146 sites serve
WPM more first-party cookies. Interestingly, the opposite is
true for third-parties. Here we find 824 sites whose third par-
ties offer WPM more cookies than WPM
; the other way
around happens for the third parties of 227 sites. In total, the
number of cookies is 55,853 (WPM) vs. 46,736 (WPM
Using WPM
thus leads to a decrease of 16.32% of cookies.
We also looked at cookies as possible means to track users.
To determine whether a cookie can be used for web tracking,
we use the approach of Englehardt et al. [28], as refined by
Chen et al. [15]. According to this method, a cookie may be
used for tracking when: (1) it cannot be a session cookie, (2)
the length of the cookie is 8 or more characters (excluding
surrounding quotes), (3) the cookie is always set, and (4) the
values differ significantly based on the Ratcliff-Obershelp
algorithm [10]. While 5,307 cookies satisfy these criteria for
WPM, only 2,282 cookies for WPM
match; a decrease of
8 Conclusions
Reliability of automated measurements on trial.
work demonstrates that OpenWPM is susceptible to attacks
threatening its reliability. In particular: virtualisation makes
scaling web studies easy, but turned out to undermine Open-
WPM’s reliability as a measurement tool. It is an open ques-
tion whether other automation / measurement frameworks
suffer similarly from virtualisation.
Bot detection on the rise.
In comparison with previous
studies, we see the number of sites looking for the
property has significantly increased in the span of less than
one year (Tbl. 9). This rapid change clearly suggests that
web sites are swiftly transitioning to responding differently to
automated clients than to regular clients. Web studies should
Table 9: Studies measuring
property access on
front pages
when analysis corpus # sites %
[40] 2019–10 dynamic Alexa 50K 2,756 5.51%
This paper 2020–07 combined Tranco 100K 13,989 13.99%
static 11,957 11,96%
dynamic 12,194 12.19%
therefore no longer ignore the potential impact of bot detec-
tion on their study.
Towards robust instrumentation.
Our findings highlight
the difficulties of deploying instruments via the page context.
To improve robustness, we advocate moving the instruments
outside of page scope. To achieve this, the debugger API
could be leveraged. However, OpenWPM uses Selenium v3,
which does not support this (planned for Selenium v4). Alter-
natively, instrumentation could be integrated in the browser’s
source code. This would give great flexibility in hiding dis-
tinctive aspects of the browser fingerprint. This would also
incur significant additional maintenance overhead slowing
adoption of new browser versions. However, OpenWPM’s
rate of adoption is already slow the tradeoff may thus be
worth it.
Advice for conducting a web measurement study.
the evaluation of our proof-of-concept is limited in scope,
we still find significant differences in a variety of attributes.
While studies that focus on the amount of traffic seem to be
(for now) in the clear, studies that focus on audio/video files or
web tracking via cookies must take bot detection into account
(Table 8). Similarly, studies that automatically crawl beyond
the front page will encounter more bot detectors (Table 1).
Our work aims to make OpenWPM a more reliable
measurement framework. We responsible disclosed our find-
ings and shared fixes of the identified issues. This helps make
OpenWPM less detectable, and therefore its results more reli-
able. Of course, a less detectable scraper may itself be abused.
For attacking specific sites, our improvements do not greatly
impact the attack surface: a less detectable OpenWPM is a fine
tool for studying thousands of sites, but not for a targeted at-
tack on a specific site. For attacks that span thousands of sites
(e.g., clickfarming), our improvements do not help: disguising
as a regular browser is insufficient to overcome contemporary
defenses. For that, site-specific fingerprints are needed [72].
Thus, existing re-identification-based countermeasures (e.g.,
rate limiting) are not impacted.
Availability & responsible disclosure.
Our stealth exten-
sion is available via GitHub.
We disclosed our findings
(both attacks and identifiable properties) to the OpenWPM de-
velopers. We are working towards having our fixes integrated
into the framework.
