9.6 Million Links in Source Code Comments: Purpose, Evolution

9.6 Million Links in Source Code Comments:

Purpose, Evolution, and Decay

Hideaki Hata

∗

, Christoph Treude

†

, Raula Gaikovina Kula

∗

and Takashi Ishio

∗

Nara Institute of Science and Technology

{hata, raula-k, ishio}@is.naist.jp

†

University of Adelaide

[email protected]

Abstract—Links are an essential feature of the World Wide

Web, and source code repositories are no exception. However,

despite their many undisputed beneﬁts, links can suffer from

decay, insufﬁcient versioning, and lack of bidirectional traceabil-

ity. In this paper, we investigate the role of links contained in

source code comments from these perspectives. We conducted a

large-scale study of around 9.6 million links to establish their

prevalence, and we used a mixed-methods approach to identify

the links’ targets, purposes, decay, and evolutionary aspects.

We found that links are prevalent in source code repositories,

that licenses, software homepages, and speciﬁcations are common

types of link targets, and that links are often included to provide

metadata or attribution. Links are rarely updated, but many link

targets evolve. Almost 10% of the links included in source code

comments are dead. We then submitted a batch of link-ﬁxing

pull requests to open source software repositories, resulting in

most of our ﬁxes being merged successfully. Our ﬁndings indicate

that links in source code comments can indeed be fragile, and

our work opens up avenues for future work to address these

problems.

I. INTRODUCTION AND MOTIVATION

When Ted Nelson started Project Xanadu

in 1960, he

envisioned “an entire form of literature where links do not

break as versions change; where documents may be closely

compared side by side and closely annotated; where it is

possible to see the origins of every quotation; and in which

there is a valid copyright system—a literary, legal and business

arrangement—for friction-less, non-negotiated quotation at

any time and in any amount” [25]. Links were supposed to

be visible and could be followed from all endpoints, with

permission to link to a document explicitly granted by the

act of publication [2]. Decades later, Nelson witnessed the

birth of the World Wide Web, which in his words “trivialized

this original Xanadu model, vastly but incorrectly simplifying

these problems to a world of fragile ever-breaking one-way

links, with no recognition of change or copyright, and no

support for multiple versions or principled re-use” [25]. As

predicted by Nelson, the Internet and its implementation

of links have afforded us countless opportunities since, but

also experienced issues such as link decay [17], [22], digital

plagiarism [9], and the need to rely on external services to

keep historical copies of web content [24].

http://www.xanadu.com/

In this work, we investigate the role of links contained in

source code comments from the perspective of these oppor-

tunities and challenges: what purposes do they serve, how do

they and their targets evolve, and how often do they break?

The signiﬁcance of this work is closely related to software

documentation [33] and self-admitted technical debt [28]. To

improve documentation and mitigate potential issues, it is

important to understand developers’ typical knowledge sharing

activities by referencing external sources, and to investigate

link decay as a potential problem.

Our work is related to and inspired by recent research on

source code comments in terms of documentation, traceability,

licensing, and attribution. For example, source code comments

have been found to document technical debt [28] and to

support articulation work [36]. They are fragile with respect

to identiﬁer renaming, i.e., traceability between comments and

code is easily lost [32]. Source code comments located at the

beginning of a ﬁle often include a text or a link indicating

the copyright and license information of the ﬁle [12]. These

comments are updated during the evolution of a product by

the copyright holders [42]. Links in source code comments

are sometimes used for attribution when source code has

been taken from elsewhere—however, the vast majority of

code snippets is copied without attribution [7], [8]. Despite

these research efforts, to the best of our knowledge, the

role of links in source code comments has not been studied

comprehensively so far.

To ﬁll this gap, in this paper, we ﬁrst lay the foundation

for understanding the role of links in source code comments

by collecting 9,654,702 links from source code comments in

25,925 Git repositories. Our parser is able to extract comments

from source code written in 7 programming languages. We

ﬁnd that links in source code comments are common: more

than 80% of the repositories in our study contained at least one

link. Through a qualitative study of a stratiﬁed sample of 1,146

links, we establish the kinds of link targets that are referenced

in source code comments. To understand how links are used to

indicate issues related to attribution, technical debt, copyright,

and licensing, our qualitative study also uncovers the various

purposes for including links in source code comments. We

ﬁnd that licenses, software homepages, and speciﬁcations are

among the most prevalent types of link targets, and that links

are often used to provide metadata or attribution.

arXiv:1901.07440v2 [cs.SE] 15 Feb 2019

Link decay has the potential of making documentation in

source code comments fragile and buggy. We investigate this

issue from two perspectives: we analyze the evolution of the

links in the repositories’ commit histories and we examine how

often link targets referenced in source code comments change.

We ﬁnd that links are rarely updated, but their targets evolve,

in almost 10% of all cases leading to dead links. We then

submit ﬁxes to a subset of these broken links as pull requests,

most of which were successfully merged by the maintainers

of the corresponding open source projects.

In summary, this paper’s contributions are three-fold:

• a large-scale and comprehensive study of around 9.6

million links to establish the prevalence of links in source

code comments,

• a mixed-methods study to identify targets, purposes, and

evolutionary aspects of links in source code comments,

and

• an analysis of the extent to which links in source code

comments are affected by link decay, with all nine link-

ﬁxing pull requests submitted to active open source

projects already merged by the projects’ maintainers.

II. RESEARCH METHOD

In this section, we present our research questions and data

collection methodology, and we introduce the data contained

in our online appendix.

A. Research Questions

The main goal of the study is to gain insights into the pur-

poses, evolution and decay of links in source code comments.

Based on this goal, we constructed seven research questions

to guide our study. We now present each of these questions,

along with the motivation for each.

(RQ1): How prevalent are links in source code comments?

The motivation of RQ1 is to understand whether the use

of links in source code is a common practice in the wild.

Furthermore, we would like to quantitatively explore the

distribution, diversity, and spread of these links across different

types of software projects.

(RQ2): What kind of link targets are referenced in source

code comments?

(RQ3): What purpose do links in source code serve?

RQ2 and RQ3 require a deeper analysis of the repositories,

where we would like to understand the nature and purpose

that the links serve. The key motivation for RQ2 is to identify

the types of link targets that developers are likely to refer

to in source code comments. Furthermore, we would like

to characterize the most common types of linked domains.

The key motivation for RQ3 is to determine the reasons why

developers use links.

(RQ4): How do links in source code comments evolve?

(RQ5): How frequently do link targets referenced in source

code comments change?

(RQ6): How many links in source code comments are dead?

TABLE I: Collected repositories and links

# candidate repo # obtained repo (%) # links

C 2,771 2,482 (90%) 1,602,156

C++ 3,563 3,211 (90%) 1,686,575

Java 4,995 4,472 (90%) 2,925,909

JavaScript 7,130 6,224 (87%) 1,533,219

Python 5,263 4,715 (90%) 413,020

PHP 3,279 2,827 (86%) 1,405525

Ruby 2,233 1,994 (89%) 88,298

sum 29,234 25,925 (89%) 9,654,702

RQ4, RQ5, and RQ6 investigate the phenomenon of links in

source code comments from an evolutionary and maintenance

standpoint. We would like to understand whether developers

are updating or maintaining these links after introducing them

to the source code, whether the targets evolve, and how many

of the links are affected by link decay.

(RQ7): Can we ﬁx dead links in source code comments?

Our aim for RQ7 is two-fold. First, we would like to show

that dead links can be restored and ﬁxed. Second, the results

can serve as a validation that developers care about the

maintenance of links in their code.

B. Data Collection

We now describe our methods for repository preparation,

comment extraction, and link identiﬁcation.

Repository preparation. In this work, we analyzed active

software development repositories on GitHub written in com-

mon programming languages. As common programming lan-

guages, we selected seven languages: C, C++, Java, JavaScript,

Python, PHP, and Ruby. These languages have been ranked

consistently in the top 10 languages on GitHub from 2008

to 2017 (based on the number of repositories from 2008 to

2015 [20], the number of pull requests from 2014 to 2017 [10],

and the number of pull requests in 2017 [13]).

Using the GHTorrent dataset

[16], we collected active

repositories for the seven languages using the following

criteria: (i) having more than 500 commits in their entire

history (the same threshold used in previous work [4]), and

(ii) having at least 100 commits in the most active two

years. We designed the second criterion to remove long-term

less active repositories and short-term projects that have

not been maintained for long (and may not be software

development projects). For example, we were able to exclude

software-engineering-amsterdam/sea-of-ql,

which is a repository of a collaboration space for students

in a particular university course, and was reported as a false

positive of software project identiﬁcation [23]. We determine

repositories’ languages based on the GHTorrent information.

Forked repositories are excluded if repositories are recorded

in GHTorrent as forks of other repositories.

With the above criteria, we prepared the candidate list of

target repositories for the seven languages as shown in Table I.

MySQL database dump 2018-04-01 from http://ghtorrent.org/downloads.

html.

When we collected these candidate repositories (from May to

June 2018), some repositories were not available because they

had been deleted or made private. In total, we obtained more

than 25,000 repositories, which is almost 90% of the candidate

repositories.

Comment extraction. From each Git repository, we extract

source ﬁles of the labeled language in the HEAD commit (the

latest snapshot of a cloned repository). For example, only

.java ﬁles are extracted from a Java repository. To process

source ﬁles, we employ ANTLR4 lexical analyzers for six

languages other than Ruby because their grammar deﬁnitions

are available in the ofﬁcial example repository.

For Ruby, we

use a standard library, Ripper parser.

We extract all single line comments (e.g., // in C) and mul-

tiline comments (/

...

/) according to the grammars. In

the case of Python, string literals (’’’ ... ’’’) are also re-

garded as comments because they include documents (known

as docstring). In the case of PHP, both HTML comments and

PHP code comments are extracted.

Link identiﬁcation. From the extracted comments, links

are identiﬁed using the regular expression /http\S+/ (lo-

calhost and IP addresses, which are mainly used for pri-

vate addresses, are excluded) and validated with the Perl

module Data::Validate::URI. We identiﬁed a total of

9,654,702 links from the collected repositories as seen in

Table I. All links are recorded with the information of the

corresponding ﬁle, repository identiﬁers (pairs of account and

repository names), commit hashes, and the line number where

the surrounding comment starts. Considering the number of

repositories, we found that repositories written in C, C++, and

Java tend to contain more links compared to repositories in

Python and Ruby.

C. Online Appendix

Our online appendix contains our 9,654,702 links associated

with the information of languages and comment location

(GitHub links including account names, repository names,

commit hashes, ﬁle paths, and line numbers). The appendix is

available at https://github.com/NAIST-SE/9.6MillionLinks.

III. FINDINGS

In this section, we present our ﬁndings for each research

question.

A. Prevalence of Links (RQ1)

To understand the prevalence of links referenced in source

code comments (RQ1), we conducted a quantitative analysis

of our collected dataset in terms of link existence, domain

diversity, and domain popularity.

Link existence. Figure 1a show the percentages of reposito-

ries that have at least one link in their source code comments.

We see that, in every language, more than 80% of the

repositories contain links in source code comments. Especially

for repositories written in C, C++, and PHP, more than 90%

of the repositories refer to external sources via links.

https://github.com/antlr/grammars-v4/

Domain diversity. In the obtained 9,654,702 links, there

are 57,039 distinct domains (Internet hostnames). Figure 1b

shows the distribution of the number of distinct domains

per repository, for repositories that have at least one link in

their source code comments. Median values are presented in

the ﬁgure. We found that there is a diversity of links in a

single repository even when summarized by their domains.

Especially in repositories written in C, C++, JavaScript, and

PHP, source code comments link to 10 or more different

domains (median).

Popular domains. Figure 1c illustrates the proportion of

languages shared by the top 10 most referenced domains. Note

that domain ranking is based on the number of repositories

instead of the number of links. If links belonging to a domain

appear in a small number of repositories, the domain will be

low-ranked even if those repositories contain many links.

The github.com domain is the top referenced domain

in our dataset. More than 14,000 repositories across seven

languages referenced content on github.com. As we will

describe in detail in Section III-B, such content includes

software homepage, code, and proﬁle of a GitHub contrib-

utor. However, we ﬁnd in Section III-F that many links

to github.com are no longer available. We also found

many links to code.google.com (7th rank). Such content

includes bug report, software homepage, and code. In a

statistically representative sample of common domains (sam-

pling described in Section III-B), two out of three links to

code.google.com are redirected to github.com, and

one links to code.google.com/archive/.

The stackoverflow.com domain is the second most

referenced domain and has been linked to from 8,189 repos-

itories. As identiﬁed in previous work, Stack Overﬂow is

widely used as a knowledge exchange platform between

programmers [38], where programmers can obtain knowledge

of good practices from code examples [29], [35], for example.

The large number of links to stackoverflow.com in

source code comments can be another piece of evidence of

developers’ needs for knowledge acquisition from external

resources. We study how code could be obsolete by not being

updated when external sources change in Section III-E.

The top domains differ by programming language: The

www.apache.org domain is frequently linked from Java

repositories, and the www.gnu.org domain is referenced

from C and C++ repositories. Repositories written in

JavaScript have many links to the Web-related domains of

www.w3.org and developer.mozilla.org.

Summary: We revealed that links in source code com-

ments are prevalent. In more than 80% of the 25,925

active repositories written in seven common languages,

there exists at least one link in each repository. The top

three most frequently referenced domains per reposi-

tory are github.com, stackoverflow.com, and

en.wikipedia.com.

92% 8%

91% 9%

89% 11%

87% 13%

92% 8%

87% 13%

84% 16%

C++

Java

JavaScript

PHP

Python

Ruby

type

noLink

Link

(a) Ratio of repositories with links

c c++ java javascript php python ruby

(b) Distribution of number of different domains per repository

10.4% 13.1% 16.8% 23.9% 17.5% 11.7% 6.7%

1.4%2.8%2.7% 81.3% 1.6% 8.7% 1.5%

18.5% 13.8% 13.7% 22.1% 9.6% 19.9% 2.4%

18.1% 20.8% 4.5% 39.1% 6% 10.2% 1.2%

8.9% 15% 15.1% 39.9% 8.9% 11.7% 0.7%

21.8% 22.4% 14.2% 8.1% 14.8% 17.6% 1%

6.3% 9.8% 15.1% 42.7% 9.3% 15.1% 1.7%

6.8% 9.2% 37.5% 26.8% 11.1% 6.2%2.3%

14.1% 18.8% 13.8% 24.9% 13.4% 11.8% 3.2%

8% 14.1% 15.3% 30.3% 18.4% 9.8% 4%

8.3% 10.5% 13.2% 31.7% 16.9% 11.2% 8.2%

others

(10) developer.mozilla.org

(9) tools.ietf.org

(8) msdn.microsoft.com

(7) code.google.com

(6) www.gnu.org

(5) www.w3.org

(4) www.apache.org

(3) en.wikipedia.org

(2) stackoverflow.com

(1) github.com

Ruby

PHP

Python

JavaScript

Java

C++

Fig. 1: Analysis of links by (a) languages, (b) domain diversity, and (c) top domains for RQ1

Fig. 2: Distribution of links per domain

B. Link Targets (RQ2)

To understand what kind of link targets are referenced in

source code comments (RQ2), we conducted a qualitative

study of a statistically representative and stratiﬁed sample of

all links in our dataset.

After an initial analysis of the link data, it quickly became

obvious that some domains account for many links while other

domains are rare. Based on this observation and to ensure

diversity of our sample, we divided the data into three strata:

1) links to commonly linked domains,

2) links to domains sometimes linked, and

3) links to rarely linked domains.

To decide on thresholds for distinguishing domains into

those that are commonly, sometimes, and rarely linked, we

conducted a visual analysis of the distribution of links per

domain in our dataset. Figure 2 shows this distribution using

a log scale. While content from the most commonly linked

TABLE II: Construction of the stratiﬁed sample

strata # domains # links # links in sample

common 2,013 9,128,444 384

sometimes 30,851 502,083 384

rare 24,175 24,175 378

sum 57,039 9,654,702 1,146

domain was linked more than a million times, many domains

appeared in our dataset with a much lower frequency. We used

the “step” in the distribution on the left-hand side of Figure 2

to distinguish between domains that are commonly linked and

domains that are sometimes linked, with a cutoff frequency of

230. We consider domains which account for exactly one link

in our dataset to be rarely linked. Table II shows the number of

domains and the number of links in each strata. We then drew

a statistically representative sample from each bucket. The

required sample size was calculated so that our conclusions

about the ratio of links with a speciﬁc characteristic would

generalize to all links in the same bucket with a conﬁdence

level of 95% and a conﬁdence interval of 5.

The calculation of

statistically signiﬁcant sample sizes based on population size,

conﬁdence interval, and conﬁdence level is well established

(ﬁrst published by Krejcie and Morgan in 1970 [19]).

The qualitative analysis was conducted in multiple itera-

tions: in the ﬁrst iteration, the ﬁrst two authors independently

coded 20 links from the sample, discussed a common coding

https://www.surveysystem.com/sscalc.htm

guide, and tested this coding guide on another 20 links from

the sample, reﬁning the guide, merging codes, and adding

codes which had been missed. The initial codes were informed

by those used by Aniche et al. [5] to categorize content posted

on news aggregators, however, we found that their codes did

not cover all types of link targets present in our dataset.

In the second iteration, the four authors of this paper then

independently coded another 30 links from the sample, using

the coding guide designed by the ﬁrst two authors. We then

calculated the kappa agreement of this iteration between all

four raters, for 30 cases and all 19 codes that emerged from

the qualitative analysis.

The kappa agreement was 0.81 or

“almost perfect” [39]. Based on this encouraging result, the

remaining data was then coded by a single author.

The following list shows the 19 codes that emerged from

our analysis along with a short description which was available

in the coding guide:

• 404: link target does not exist (anymore) or cannot be

accessed

• licence: licence of a software project

• software homepage: main web presence of a library or

software project

• speciﬁcation: anything that resembles a requirements doc-

ument or a technical standard

• organization homepage: main web presence of an orga-

nization or company

• other: anything that does not ﬁt the other codes, including

if sign-in is required

• tutorial or article: technical article or tutorial, without

commenting section (blog post otherwise)

• API documentation: documentation of an API element

• blog post: technical content with a commenting section

• application: interactive application (e.g., web application,

online utility)

• bug report: bug report or issue in an online bug/issue

tracker

• research paper: academic paper

• personal homepage: personal homepage of one individual

• code: a source code ﬁle

• forum thread: thread in a forum or entire forum

• GitHub proﬁle: proﬁle of a GitHub contributor

• book content: chapter/section of a book or entire book

• Q&A thread: question-and-answer thread, but not Stack

Overﬂow

• Stack Overﬂow: question-and-answer thread on Stack

Overﬂow

Taxonomy of link targets. Table III shows the result of our

qualitative analysis. For commonly-linked domains, license is

the most common type of link target, accounting for more

than half of the links in our sample, followed by software

homepages, i.e., the main web presence of a library or software

project. For domains that are linked sometimes from source

code comments, the most common type of link target was 404,

a non-existing link target. This is a ﬁrst indicator of the decay

Kappa agreement was calculated using http://justusrandolph.net/kappa/.

TABLE III: Frequency of link target types in our sample

common sometimes rare

404 27 (7%) 122 (32%) 138 (37%)

license 208 (54%) 4 (1%) 1 (0%)

software homepage 55 (14%) 65 (17%) 28 (7%)

speciﬁcation 21 (5%) 33 (9%) 32 (8%)

organization homepage 16 (4%) 41 (11%) 24 (6%)

other 5 (1%) 23 (6%) 45 (12%)

tutorial or article 16 (4%) 21 (5%) 31 (8%)

API documentation 14 (4%) 20 (5%) 10 (3%)

blog post 1 (0%) 10 (3%) 22 (6%)

application 0 (0%) 11 (3%) 13 (3%)

bug report 9 (2%) 10 (3%) 3 (1%)

research paper 0 (0%) 9 (2%) 13 (3%)

personal homepage 4 (1%) 8 (2%) 4 (1%)

code 6 (2%) 2 (1%) 5 (1%)

forum thread 0 (0%) 5 (1%) 6 (2%)

book content 0 (0%) 0 (0%) 2 (1%)

GitHub proﬁle 1 (0%) 0 (0%) 0 (0%)

Stack Overﬂow thread 1 (0%) 0 (0%) 0 (0%)

Q&A thread 0 (0%) 0 (0%) 1 (0%)

sum 384(100%) 384(100%) 378(100%)

of links in source code comments, which we will analyze

in detail in the next sections. Software homepages are also

prevalent, as are organization homepages, both accounting for

more than 10% of all links in our sample. Finally, for links

from domains which are rarely linked, the problem of decay is

even more serious, affecting 37% of the links in this sample.

In other words, we can conclude with a 95% conﬁdence that

between 32 and 42% of all links to domains which are rarely

linked from source code comments are dead or inaccessible.

The prevalence of the code “other” in the results for links to

rarely linked domains is an indicator of the diversity of links

present in source code comments.

Summary: We identiﬁed more than a dozen different

kinds of link targets, with dead links, licenses, and

software homepages being the most prevalent. Dead

links are particularly common for rarely linked do-

mains.

C. Link Purpose (RQ3)

To understand the purpose of links referenced in source code

comments (RQ3) and similar to (RQ2), we again employed

a qualitative analysis of our statistically representative and

stratiﬁed sample of 1,146 links, only this time focusing on

the origin of a link (in a source comment) rather than the

target of the link. We used the same iterative approach to

design a coding guide, and validated the coding guide by

having the four authors code 30 links independently, this

time leading to a kappa agreement of 0.70 which indicates

“substantial” agreement [39]. The somewhat lower agreement

can be explained by the need to extrapolate the purpose of a

link from its context in the source code alone, without being

able to interview the contributor who added the link.

The following list shows all 8 codes that emerged from

our analysis for link purpose, along with a short description

TABLE IV: Frequency of link purposes in our sample

common sometimes rare

metadata 288 (75%) 131 (34%) 43 (11%)

source/attribution 27 (7%) 62 (16%) 75 (20%)

source code context 18 (5%) 60 (16%) 80 (21%)

see-also 28 (7%) 59 (15%) 51 (13%)

commented-out source code 1 (0%) 17 (4%) 70 (19%)

link-only 6 (2%) 24 (6%) 40 (11%)

self-admitted technical debt 11 (3%) 16 (4%) 13 (3%)

@see 5 (1%) 15 (4%) 6 (2%)

sum 384(100%) 384(100%) 378(100%)

which was available in the coding guide. The coding guide was

informed by work on source code comments (e.g., [36]), self-

admitted technical debt (e.g., [28]), and attribution (e.g., [7]).

• metadata: the link relates to the author of the source code,

a related organization, or the license

• source/attribution: the comment explicitly indicates that

the link is a source of some aspect of the source code

(e.g., algorithm)

• source code context: the link adds additional information

to the source code (use this code for things that do not

obviously ﬁt into any of the previous)

• see-also: the comment explicitly indicates that the link

points to additional reading material (usually accompa-

nied by a phrase such as “see also”).

• commented-out source code: the link is part of the source

code, e.g., as a parameter value, but has been commented

out

• link-only: the comment only contains the link

• self-admitted technical debt: bug-related, like

workaround, under development, and so on

• @see: the link is accompanied by “@see”, but no further

explanation

Note that our coding guide required the indicators of see-

also and source/attribution to be explicit, thus reducing the

guesswork required as part of the qualitative analysis.

Taxonomy of link purpose. Table IV shows the results

of the qualitative analysis. For links to commonly linked

domains, providing metadata, e.g., in the forum of licenses

or author information, is by far the most common purpose of

a link, covering three quarters of the links in our sample. For

links to domains which are only sometimes linked, metadata

only accounts for one third of the data, followed by links

included for the purpose of attribution, providing context, or

see-also information. The results for links to rarely linked

domains are even more diverse: we can see from the table

that these links are used for context, attribution, and as part of

the source code functionality (albeit commented out), to name

the top three. Six of the eight codes account for at least 10%

of the links in this part of our sample.

Matching link target with purpose. Based on the quali-

tative analysis conducted for answering RQ2 and RQ3 about

the targets and purposes of links in source code comments,

we are now able to investigate the relationships between the

different types of link targets and the different purposes which

TABLE V: Associations between link target type and link

purpose

strata association rule conf. supp.

common license ⇒ metadata 1.00 208

common metadata ⇒ license 0.72 208

common software homepage ⇒ metadata 0.75 41

common organization homepage ⇒ metadata 0.88 14

common bug report ⇒ satd 0.78 7

common satd ⇒ bug report 0.64 7

common personal homepage ⇒ metadata 1.00 4

sometimes software homepage ⇒ metadata 0.65 42

sometimes organization homepage ⇒ metadata 0.80 33

sometimes personal homepage ⇒ metadata 1.00 8

sometimes license ⇒ metadata 1.00 4

rare personal homepage ⇒ metadata 1.00 4

emerged from our qualitative analysis. To do so, we applied

association rule learning using the apriori algorithm [1] as

implemented in the R package arules

to our data, treating

each link as a transaction containing two items: its target type

and its purpose. We used 4 as threshold for support and 0.7

as threshold for conﬁdence, i.e., all rules that we extracted are

supported by at least four data points and we have at least a

70% conﬁdence that the left hand side of the rule implies the

right hand side.

Table V shows the association rules extracted from our data

with these settings, separately for each stratum in our sample.

Unsurprisingly, the link target type license and the purpose

of providing metadata are tightly connected, in particular for

links referring to commonly linked domains. In fact, all links

to licenses were found to have been included for the reason

of providing metadata, and 72% of the metadata is license

information. Links to software, organization, and personal

homepages are also associated with metadata, across all strata.

Although with a relatively low support of seven instances, it

is also interesting to note the tight coupling of the link target

type bug report and the purpose of admitting technical debt.

Summary: We identiﬁed different purposes for the

inclusion of links in source code comments, with

providing metadata and attribution being the most

common. Links are also included for background in-

formation, to provide context, or to admit technical

debt. In some cases, the link is part of source code

which has been commented out.

D. Link Evolution (RQ4)

To understand how links evolve (RQ4), we investigated the

revision histories of repositories in the samples from (RQ2).

For each sample link, we searched an old version of the link

that has been revised by a commit that introduced the link.

We extracted such a commit introducing a link by using the

git log command (-S option with tracking ﬁle renaming). We

searched http(s) links removed from the code location where

https://cran.r-project.org/web/packages/arules/index.html

TABLE VI: Link Evolution Types

common sometimes rare

license 12 (50%) 1 (3%) 1 (3%)

organization 7 (29%) 10 (32%) 3 (9%)

https 2 (8%) 1 (3%) 5 (15%)

content move 0 (0%) 2 (6%) 6 (18%)

content update 1 (4%) 3 (10%) 4 (12%)

content change 0 (0%) 6 (19%) 2 (6%)

other 2 (8%) 8 (26%) 12 (36%)

sum 24 (100%) 31 (100%) 33 (100%)

the sample link has been added. We identiﬁed 88 revised links

out of 1,146 samples, including 24 (6.3%) in common, 31

(8.1%) in sometimes, and 33 (8.7%) in rare. Less than 10%

of the links had been revised in each strata, that is, most of

the links have never been updated.

We manually analyzed the old and new paths of the links

and identiﬁed the following evolution types:

• license replacement: a new link refers to a new software

license. For example, a link to GNU GPL has been

replaced with a link to the Apache License.

• organization update: a project or an organization changed

its name or website. For example, a project that acquired

their own domain updated links to their project website.

• change to https: a new link uses HTTPS instead of HTTP

for the same location as the previous link.

• content move: a new link refers to a slightly different

location (e.g. the same path on a different server, the same

document name on a different wiki), which is likely the

same content.

• content update: a new link refers to different content from

the previous link, but the new content is likely updated.

For example, the Apache Jackrabbit project replaced a

link pointing to a draft version of a document

with a

link to an RFC version.

• content change: a new link refers to relevant but different

content from the previous link. For example, the Pi4J

project replaced a link related to the usage of a serial

port of Raspberry Pie

with another similar document.

• other: we could not identify types for some links whose

contents are no longer available. It should be noted that

the contents for 20 updated links are 404 Not Found.

Reasons for link evolution. Table VI shows the numbers

of link evolution in the three strata. For commonly-linked

domains, license replacement and updating organizational

information account for about 80% of link revisions. For

domains sometimes linked, organization update is the most

common, followed by other and content change. For rarely-

http://greenbytes.de/tech/webdav/draft-ietf-webdav-bind-21.html

http://greenbytes.de/tech/webdav/rfc5842.html

http://www.irrational.net/2012/04/19/using-the-raspberry-pis-serial-port/

https://www.cube-controls.com/2015/11/02/

disable-serial-port-terminal-output-on-raspbian/

TABLE VII: Link target evolution within a 10-day timeframe

evolution # of links

no content returned on both dates 112

no changes between the two dates 879

content not available anymore 6

auto-generated changes (e.g., # of visitors) 125

content changed 7

“latest” updated 7

design change (e.g., new banner added) 6

different error (e.g., 502 ⇒ 301) 3

page title changed 1

sum 1,146

linked domains, links are revised for content move, content

update, and change to https.

Summary: Links are rarely updated (less than 9%).

Common modiﬁcations are updating licenses and or-

ganization homepages.

E. Link Target Evolution (RQ5)

After understanding the evolution of links, our next research

question (RQ5) asks about the evolution of their targets. To

investigate whether link targets referenced in source code

comments evolve, we attempted to download all link targets

in our sample of 1,146 links using the curl command with a

timeout of 60 seconds. As already discussed as part of (RQ2),

not all link targets are available. We were able to download a

total of 1,034 link targets (90%). We then repeated the same

download process exactly ten days later, to see how many of

the link targets had changed within this short time frame and

what kind of changes had happened.

Changes to the link target. Table VII summarizes the

results of this analysis: out of the 1,034 link targets for which

curl returned a result, 879 (85%) had not changed at all in the

ten-day time frame (the downloaded content was exactly the

same, as per the Windows ﬁle compare tool fc). We manually

analyzed the 155 cases in which the content had changed

by opening both versions in a web browser and conducting

a visual comparison. The majority of the changes in the

remaining 15% can be attributed to automatically generated

changes, such as the display of a visitor count or the current

date in a footer.

However, a non-negligible number of link targets underwent

more signiﬁcant changes in the ten-day time window: For

six links for which we were able to retrieve data on the

ﬁrst download date, there was no content available anymore

ten days later. For three links which had displayed an error

message when we ﬁrst attempted to download their content,

the speciﬁc error message changed. Some link targets changed

their website design, and for a few links, the content changed.

For example, the download page of TaskWarrior

included the

following notice when we ﬁrst downloaded its content: “(For

https://taskwarrior.org/download/

those of you wishing to build task from source on Cygwin,

you will need some components installed (make, g++/clang,

GnuTLS, libuuid, libreadline), but don’t forget - task is a

standard part of the Cygwin distribution, so you do not need

to build from source, unless you want the latest development

snapshot).” Ten days later, this notice was replaced with:

“(Please note, that Cygwin is not supported anymore. Please

use the Windows Subsystem for Linux to use Taskwarrior on

Windows).” We argue that this kind of change is relevant to

software developers.

Stack Overﬂow case study. To investigate this phenomenon

in more detail, we conducted a case study with the subset of

links pointing to Stack Overﬂow. As seen in Section III-A,

stackoverflow.com is the second most referenced do-

main.

In all 9,654,702 obtained links, there are 32,197 links

belonging to stackoverflow.com. Among those Stack

Overﬂow links, there are varieties of expressions: an abbrevi-

ated path to an answer (/a/(answer id)), an abbreviated

path to a question (/q/(question id)), and a full path to

a question (/questions/(question id)/(title)).

Older links start with ‘http://’ and newer links start with

‘https://’. For each Stack Overﬂow link, we identiﬁed the

timestamp of when the link was added to a repository by

using the same git log command (-S option with tracking

ﬁle renaming) used in Section III-D. For duplicate links, we

consider only the oldest timestamp. Consequently, we obtained

a list of 11,464 distinct links with their timestamps.

We then made use of the SOTorrent dataset [8] to investigate

the extent to which Stack Overﬂow content had changed since

the link to the question or answer had been added to a source

code comment in a Git repository. We created a statistically

representative sample of 372 links from the population of all

unique links to Stack Overﬂow content in our dataset, and we

queried SOTorrent to determine the following metrics for each

link:

• the number of text edits on any post (i.e., question or

answer) in the same thread,

• the number of new comments on any post (i.e., question

or answer) in the same thread,

• the number of new answers in the same thread, and

• the number of edits to the thread title.

Thread updates. Figure 3 shows the results of this analysis.

More than half of all Stack Overﬂow threads had at least one

change made to the text of a question or answer in the same

thread (median: 1, third quartile: 3) after they were committed

to a Git repository as part of a source code comment, and more

than half of these links attracted at least one new comment in

the meantime (median: 2, third quartile: 7). While the number

of new answers to a thread was zero in the median case, a

quarter of the Stack Overﬂow threads attracted at least 2 new

answers after the link was added in a source code comment

(median: 0, third quartile: 2). In total, only 91 (24%) of the

372 Stack Overﬂow threads in our sample did not undergo any

changes after they were added to a Git repository.

Text Edits Comments Answers Title Edits

0 20 40 60 80 100

Text Edits Comments Answers Title Edits

0 20 40 60 80 100

Fig. 3: Number of changes to the linked Stack Overﬂow

threads after they were ﬁrst referenced

The most extreme example in our sample—a Stack Over-

ﬂow thread titled “How do you split a list into evenly sized

chunks?”

—had 45 new answers added (out of a total of 58)

and attracted 103 new comments, 50 edits, and one change to

the thread title after being ﬁrst referenced in a source code

comment in early September 2010.

Summary: We found that even within a short ten-day

time window, a non-negligible portion of link targets

referenced in source code comments evolve, in some

cases adding or modifying pertinent information. In

our case study on links pointing to Stack Overﬂow,

we found that more than three quarters of all Stack

Overﬂow threads linked in source code comments

attracted at least one change (edit, new answer, or new

comment) after being ﬁrst referenced in a source code

comment.

F. Link Decay (RQ6)

Among the obtained 9,654,702 links, there are 382,650

distinct links. To investigate the amount of dead links in source

code comments (RQ6), we accessed all Web contents from the

382,650 unique links by using the Perl module LWP.

Link retrieval responses. Table VIII shows obtained HTTP

Status Codes for all unique links. More than 80% of the

links were successfully reachable including redirecting to new

locations. Although some HTTP status codes do not clearly

mean links are dead, we found that about 9% of the links

are not available now (404 Not Found). Among these 404

links, we saw some patterns: Internet host lost, Web content

move/deletion, fake URL (for showing examples of parameters

or data structures), and link typo. The domain with the largest

number of 404s is github.com, with 3,346 links no longer

available or pointing to private repositories.

https://stackoverﬂow.com/q/312443

We used LWP::UserAgent and LWP::RobotUA.

TABLE VIII: HTTP Status Codes for All Unique Links

status # links (%)

2xx success 310,592 (81.2%)

404 not found 34,689 (9.1%)

500 internal server error 18,955 (5.0%)

405 method not allowed 10,104 (2,6%)

403 forbidden 4,068 (1.0%)

others 4,242 (1.1%)

sum 382,650 (100%)

Summary: We found that 9.1% of the link targets are

not available (404 Not Found), considering all unique

links.

G. Fixing Dead Links (RQ7)

To ﬁx dead links (RQ7), we collected ﬁxable dead links and

submitted pull requests to ﬁx. We select dead links that are not

metadata (need multiple ﬁles to be ﬁxed) nor commented-out

source code. Personal blog articles were avoided because they

tend to be no longer available. Consequently we obtained 14

dead links to API documentation, research papers, and so on.

After checking the original content in Wayback Machine

we manually investigated new links by searching speciﬁc

keywords in the original content. Our ﬁxing process included

ﬁrst forking a personal copy of the project, ﬁxing the link, and

then later submission of a pull request to the project.

Pull request results. Developers showed they cared about

dead links by accepting to all nine pull requests.

15 16 17 18 19

20 21 22 23

Since the link itself is a comment, we speculate that it has

almost no conﬂicts with existing code, so our pull requests

are likely to pass all tests and to be merged immediately.

Developers responded with comments such as “LGTM (looks

good to me)” and “Thanks for spotting the broken link”.

Overall, the responses from developers provide sufﬁcient

motivation for tool support to assist with ﬁxing broken links.

We argue that such comments indicate that developers are

concerned with keeping their links alive.

Summary: Developers generally responded positively

to the request to ﬁx dead links. All nine responsive

projects accepted our pull requests to ﬁx dead links.

https://web.archive.org/

https://github.com/mockingbirdnest/Principia/pull/1905

https://github.com/sveawebpay/php-integration/pull/82

https://github.com/shirasagi/shirasagi/pull/2289

https://github.com/onepercentclub/bluebottle/pull/3372

https://github.com/cms-sw/cmssw/pull/24370

https://github.com/BrowserSync/browser-sync/pull/1593

https://github.com/ShipSoft/FairShip/pull/133

https://github.com/BlackToppStudios/Mezzanine/pull/182

https://github.com/onepercentclub/bluebottle/pull/3372

IV. RECOMMENDATIONS

Our ﬁndings can be summarized into recommendations for

developers and researchers.

Recommendations for software developers including links

in source code comments are:

• Try referencing permanent links, as it is reported that

more than 30% of links will not work after a 4 year

period [18]. Referencing research papers with DOI is

preferable instead of researchers’ personal Web pages.

Explicitly mentioning tags or commit hashes to ref-

erenced code in GitHub would be recommended, as

software structure can be changed (we found many dead

links to GitHub in Section III-F).

• Check link targets for new information on a regular basis,

as referenced external resources can be considered to be

software documentation to support comprehension and

maintenance activities. In addition, link target updates can

be triggers of improving and updating code (as seen in

Section III-E).

We can also consider future work with the following pos-

sible challenges.

• Further understanding of external sources. We found

many sources as shown in Figure 1c and Table III.

Although some sources have been already studied, for

example, licenses [12], self-admitted technical debt [28],

and Stack Overﬂow [38], other sources have not been

well-studied with regard to their impact and inﬂuence

on software development, such as research papers and

Wikipedia articles.

• Further studies of source code comments to understand

how knowledge (related to knowledge-based theory [44]

and human capital [26], [40]) is summarized and shared

via source code comments. Further analyses of source

code comment contents [27] would be required.

• Tool support for external source referencing, tracking,

and updating. Although we recommend developers to

maintain links and associated code, it is not always pos-

sible. Tools or systems to help developers ﬁx link issues

and maintain code automatically could be practically

useful.

V. THREATS TO VALIDITY

Threats to the construct validity exist in our approach to

link identiﬁcation. Since we identiﬁed links per line in source

code comments, links located across multiple lines cannot be

extracted. Note that we did not encounter any such multiple-

line links in our representative sample of 1,146 links. Hence

we consider that the impact of incorrect link identiﬁcation

because of multiple-line links is small.

Threats to the external validity exist in our repository prepa-

ration. Although we analyzed a large amount of repositories on

GitHub, we cannot generalize our ﬁndings to industry nor open

source projects in general; some open source repositories are

hosted outside of GitHub, e.g., on GitLab or private servers.

To mitigate threats to reliability, we prepared an online

appendix of our 9,654,702 links with associated information

(see Section II-C).

VI. RELATED WORK

We have discussed complementary work throughout the

paper in the relevant sections; here, we discuss literature

related to web resources and source code comments.

One of the most related studies is the one by Xia et al.

[43]. They investigated what developers search for on the Web,

and found that developers search for explanations of unknown

terminology, explanations for exceptions/error messages (e.g.,

HTTP 404), reusable code snippets, solutions to common

programming bugs, and suitable third-party libraries/services.

Furthermore, they found that searching for solutions to perfor-

mance bugs, solutions to multi-threading bugs, public datasets

to test newly developed algorithms or systems, reusable code

snippets, best industrial practices, database optimization so-

lutions, solutions to security bugs, and solutions to software

conﬁguration bugs are the most difﬁcult search tasks that

developers consider.

Many researchers have made use of code comments in their

work. Tan et al. [37] automatically identify bugs by analyzing

inconsistencies between code and comments. Ratol and Robil-

lard [32] used code comments to assist refactoring activities.

Wong et al. [41] used code comments to map source code

and Stack Overﬂow content. German et al. [12] developed

the ninka tool that automatically identiﬁes a software license

in code comments. Goldman and Miller [14] developed the

tool CodeTrail, that demonstrates how the developer’s use of

web resources can be improved by connecting the Eclipse

integrated development environment (IDE) and the Firefox

web browser.

Self-admitted technical debt is a commenting activity that

has been well-studied in recent years [28]. Maldonado et

al. [21] and Zampetti et al. [45] studied the removal of

self-admitted technical debt based on the modiﬁcation of

comments. Our ﬁnding of referencing bug reports for self-

admitted technical debt could be another opportunity to study

development activities around technical debt.

There are also studies which analyze link sharing occurring

in other software artifacts. Gomez et al. [15] investigated

link sharing on Stack Overﬂow to gain insights into how

software developers discover and disseminate innovations.

Rath et al. [31] investigated links to issue tracking systems

in commit comments. They reported that developers often do

not provide external links to issues. They evaluated several

methods to automatically recover links by searching issues

related to a given commit. Alqahtani et al. [3] proposed a tool

to automatically link dependent components in a system to

online resources for analyzing their vulnerabilities. Chen et

al. [11] proposed a tool to link problematic source code to

relevant Stack Overﬂow questions using similarity of source

code fragments.

Traceability links between source code and documents is

another related research topic. Scanniello et al. [34] reported

that developers can understand source code effectively if they

can refer to design models including source code element

names. Their observation has been obtained through a con-

trolled experiment of program comprehension tasks with UML

models produced in a requirements engineering phase and a

design phase. Antoniol et al. [6] proposed a method to identify

links between source ﬁles and design documents because

developers may update source ﬁle names without updating

related documents. Their method uses similarity of attribute

names of a class to identify its original class deﬁnition in

design documents. Rahimi et al. [30] proposed a rule-based

method to update links between source ﬁles and requirements

documents. Their method recognizes a change scenario from

semantic differences of source code and then updates links

according to a rule corresponding to the change scenario.

Those methods would be effective to automatically update

traceability links. Similar tool support for external source

referencing is a future direction of our research.

However, none of the related work provides a comprehen-

sive study of the role of links in source code comments, which

is the goal of this paper.

VII. CONCLUSION

To understand purposes, evolution, and decay of links in

source code comments, we conducted (i) a quantitative study

of 9,654,702 links from source code comments in 25,925 Git

repositories to establish the prevalence of links in source code

comments; (ii) a qualitative study of a stratiﬁed sample of

1,146 links to determine the kinds of link targets and purposes

for including links present in our dataset; (iii) a quantitative

and qualitative study to investigate the evolution of links in

source code comments and their targets; and (iv) a quantitative

study to determine the extent to which links in source code

comments are affected by link decay.

Our work has shown that links in source code comments

indeed suffer from decay, from insufﬁcient versioning (when

link targets evolve), and from lack of bidirectional traceability

(which could help avoid decay). Based on this work which has

established the prevalence of links in source code comments,

their multiple purposes and targets, issues of decay, and

practical needs of ﬁxing dead links, there are many open

avenues for future work: understanding the role of external

sources for software development, further studies of source

code comments, and tool support for external source referenc-

ing, to name a few.

ACKNOWLEDGMENT

We thank the respondents to our pull requests for their

availability and Sebastian Baltes for his support in querying

the SOTorrent dataset. This work has been supported by JSPS

KAKENHI Grant Numbers JP16H05857, JP17H00731, and

JP18KT0013 as well as the Australian Research Council’s

Discovery Early Career Researcher Award (DECRA) funding

scheme (DE180100153).

REFERENCES

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules

in large databases. In Proceedings of the International Conference on

Very Large Data Bases, pages 487–499, 1994.

[2] M. Aiello. The Web Was Done by Amateurs: A Reﬂection on One of

the Largest Collective Systems Ever Engineered. Springer, 2018.

[3] S. S. Alqahtani, E. E. Eghan, and J. Rilling. Recovering semantic trace-

ability links between APIs and security vulnerabilities: An ontological

modeling approach. In Proceedings of the International Conference on

Software Testing, Veriﬁcation and Validation, 2017.

[4] M. Aniche, G. Bavota, C. Treude, M. A. Gerosa, and A. Deursen.

Code smells for model-view-controller architectures. Empirical Software

Engineering, 23(4):2121–2157, 2018.

[5] M. Aniche, C. Treude, I. Steinmacher, I. Wiese, G. Pinto, M.-A. Storey,

and M. A. Gerosa. How modern news aggregators help development

communities shape and share knowledge. In Proceedings of the Inter-

national Conference on Software Engineering, pages 499–510, 2018.

[6] G. Antoniol, B. Caprile, A. Potrich, and P. Tonella. Design-code

traceability for object-oriented systems. Annals of Software Engineering,

9(1-4):35–58, 2000.

[7] S. Baltes and S. Diehl. Usage and attribution of stack overﬂow code

snippets in github projects. Empirical Software Engineering, pages 1–

37, 2018.

[8] S. Baltes, L. Dumani, C. Treude, and S. Diehl. Sotorrent: Reconstructing

and analyzing the evolution of stack overﬂow posts. In Proceedings of

the International Conference on Mining Software Repositories, pages

319–330, 2018.

[9] J. M. Barrie and D. E. Presti. Digital plagiarism-the web giveth and the

web shall taketh. Journal of medical Internet research, 2(1), 2000.

[10] F. Beuke. Github Language Statistics – GitHut 2.0. https://madnight.

github.io/githut/. [Online; accessed Aug 2018].

[11] F. Chen and S. Kim. Crowd debugging. In Proceedings of the Joint

Meeting on the Foundations of Software Engineering, 2015.

[12] D. M. German, Y. Manabe, and K. Inoue. A sentence-matching method

for automatic license identiﬁcation of source code ﬁles. In Proceedings

of the International Conference on Automated Software Engineering,

pages 437–446, 2010.

[13] GitHub. The State of the Octoverse 2017. https://octoverse.github.com/,

2017. [Online; accessed Aug 2018].

[14] M. Goldman and R. C. Miller. Codetrail: Connecting source code and

web resources. Journal of Visual Languages & Computing, 20(4):223–

235, 2009.

[15] C. G

omez, B. Cleary, and L. Singer. A study of innovation diffusion

through link sharing on stack overﬂow. In Proceedings of the Working

Conference on Mining Software Repositories, pages 81–84, 2013.

[16] G. Gousios. The ghtorent dataset and tool suite. In Proceedings of the

Working Conference on Mining Software Repositories, pages 233–236,

2013.

[17] M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva,

K. Zhou, and R. Tobin. Scholarly context not found: one in ﬁve articles

suffers from reference rot. PloS one, 9(12):e115253, 2014.

[18] W. Koehler. Web page change and persistence—a four-year longitudinal

study. Journal of the American Society for Information Science and

Technology, 53(2):162–171, 2002.

[19] R. V. Krejcie and D. W. Morgan. Determining sample size for research

activities. Educational and psychological measurement, 30(3):607–610,

1970.

[20] A. La. Language Trends on GitHub – The GitHub Blog. https:

//blog.github.com/2015-08-19-language-trends-on-github/, 2015. [On-

line; accessed Aug 2018].

[21] E. D. S. Maldonado, R. Abdalkareem, E. Shihab, and A. Serebrenik.

An empirical study on the removal of self-admitted technical debt. In

Proceedings of the International Conference on Software Maintenance

and Evolution, pages 238–248, 2017.

[22] J. Markwell and D. W. Brooks. Broken links: The ephemeral nature

of educational www hyperlinks. Journal of Science Education and

Technology, 11(2):105–108, 2002.

[23] N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan. Curating github

for engineered software projects. Empirical Software Engineering,

22(6):3219–3253, 2017.

[24] J. Murphy, N. H. Hashim, and P. OConnor. Take me back: validating

the wayback machine. Journal of Computer-Mediated Communication,

13(1):60–75, 2007.

[25] T. H. Nelson. Xanalogical structure, needed now more than ever: parallel

documents, deep links to content, deep versioning, and deep re-use. ACM

Computing Surveys, 31(4es):33, 1999.

[26] S. Onoue, H. Hata, R. G. Kula, and K. Matsumoto. Human capital in

software engineering: A systematic mapping of reconceptualized human

aspect studies. CoRR, abs/1805.03844, 2018.

[27] L. Pascarella and A. Bacchelli. Classifying code comments in java open-

source software systems. In Proceedings of the International Conference

on Mining Software Repositories, pages 227–237, 2017.

[28] A. Potdar and E. Shihab. An exploratory study on self-admitted technical

debt. In Proceedings of the International Conference on Software

Maintenance and Evolution, pages 91–100, 2014.

[29] S. Radevski, H. Hata, and K. Matsumoto. Towards building api usage

example metrics. In Proceedings of the International Conference on

Software Analysis, Evolution, and Reengineering, pages 619–623, 2016.

[30] M. Rahimi and J. Cleland-Huang. Evolving software trace links between

requirements and source code. Empirical Software Engineering, 2017.

[31] M. Rath, J. Rendall, J. L. C. Guo, J. Cleland-Huang, and P. Mder.

Traceability in the wild: Automatically augmenting incomplete trace

links. In Proceedings of the International Conference on Software

Engineering, 2018.

[32] I. K. Ratol and M. P. Robillard. Detecting fragile comments. In

Proceedings of the International Conference on Automated Software

Engineering, pages 112–122, 2017.

[33] M. P. Robillard, A. Marcus, C. Treude, G. Bavota, O. Chaparro, N. Ernst,

M. A. Gerosa, M. Godfrey, M. Lanza, M. Linares-Vsquez, G. C.

Murphy, L. Moreno, D. Shepherd, and E. Wong. On-demand developer

documentation. In Proceedings of the International Conference on

Software Maintenance and Evolution, pages 479–483, 2017.

[34] G. Scanniello, C. Gravino, M. Genero, J. A. Cruz-Lemus, G. Tortora,

M. Risi, and G. Dodero. Do software models based on the UML aid in

source-code comprehensibility? aggregating evidence from 12 controlled

experiments. Empirical Software Engineering, 2018.

[35] J. Sillito, F. Maurer, S. M. Nasehi, and C. Burns. What makes a good

code example?: A study of programming q&a in stackoverﬂow. In

Proceedings of the International Conference on Software Maintenance,

pages 25–34, 2012.

[36] M.-A. Storey, J. Ryall, R. I. Bull, D. Myers, and J. Singer. Todo or

to bug. In Proceedings of the International Conference on Software

Engineering, pages 251–260, 2008.

[37] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: Bugs or bad

comments?*/. In Proceedings of the Symposium on Operating Systems

Principles, pages 145–158, 2007.

[38] C. Treude, O. Barzilay, and M.-A. Storey. How do programmers ask

and answer questions on the web? (nier track). In Proceedings of

the International Conference on Software Engineering, pages 804–807,

2011.

[39] A. J. Viera, J. M. Garrett, et al. Understanding interobserver agreement:

the kappa statistic. Fam Med, 37(5):360–363, 2005.

[40] C. Wohlin, D.

Smite, and N. B. Moe. A general theory of software en-

gineering: Balancing human, social and organizational capitals. Journal

of Systems and Software, 109:229–242, 2015.

[41] E. Wong, J. Yang, and L. Tan. Autocomment: Mining question and

answer sites for automatic comment generation. In Proceedings of the

International Conference on Automated Software Engineering, pages

562–567, 2013.

[42] Y. Wu, Y. Manabe, T. Kanda, D. M. German, and K. Inoue. Analysis

of license inconsistency in large collections of open source projects.

Empirical Software Engineering, 22(3):1194–1222, 2016.

[43] X. Xia, L. Bao, D. Lo, P. S. Kochhar, A. E. Hassan, and Z. Xing. What

do developers search for on the web? Empirical Software Engineering,

22(6):3149–3185, 2017.

[44] M. Zahedi and M. A. Babar. Why does site visit matter in global

software development: A knowledge-based perspective. Information and

Software Technology, 80:36–56, 2016.

[45] F. Zampetti, A. Serebrenik, and M. Di Penta. Was self-admitted technical

debt removal a real removal?: An in-depth perspective. In Proceedings

of the International Conference on Mining Software Repositories, pages

526–536, 2018.