String Analysis:
Techniques and
Applications
Lu Zhang
1
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
String Constraint Solver
Applications
Database Applications
Web Applications
Software Internationalization
...
2
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
String Constraint Solver
Applications
Database Applications
Web Applications
Software Internationalization
...
3
Basic Concepts
String Variables
In strongly typed languages (e.g., Java), String
Variables are variables in the program with a string
type.
str in String str;
In weakly typed languages (e.g., PHP), String
Variables are variables that may be assigned a
string value.
$str in $str = “abc”;
4
Basic Concepts
String Constants
A sequence of characters within a pair of double
quotation
String operations
String operations are library functions that takes
several string variables as inputs and output a string
variable (i.e., String.length() is usually not
considered a string operation)
String 1
String n
String
Operation
Output String
5
Basic Concepts
Common string operations
Concatenation
x = a + b;
Replace
x = a.replace (“a”, “b”);
Substring
x = a.substring(3,5);
Tokenize
x = a.nextToken();
6
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
7
Basic String Analysis
Purpose
Approximately estimate the possible values of a
certain string variable in a program
Hot Spot
A hot spot is a certain occurrence O of a certain
string variable v in the source code, the possible
values of the string variable v at the occurrence O
require to be estimated.
8
01 String str = “abc”
02 if(x>5){
03 str = str + “cd”
04 }
05 System.out.println(str) <- Hot Spot
Possible value of variable str at 05: “abc”, “abccd”
String variable with finite possible values
Basic String Analysis
9
Basic String Analysis
String variable with infinite possible values
01 String str = “|
02 while(x<readNumber()){
03 str = str + “a”+”|”;
04 x++;
05 }
06 System.out.println(str) <- Hot Spot
Possible value of variable str at 06: “|”, “|a|”, “|a|a|”…
10
Techniques
How to deal with infinite possible values?
Using formal languages to represent the set of
possible values
Two options
Automaton (Regular Grammar) Based String Analysis
CFG Based String Analysis
11
Automaton Based String Analysis
Use an automaton M to represent the
possible values of a hot spot
The set of strings that the automaton M
accepts is a super set of the possible values
of a hot spot
Proposed by Christensen et al. from
University of Aarhus, Denmark in 2003
12
Automaton Based String Analysis
Steps
Extract String Flow Graph from the source code of
the need-to-analyze program
Transform the String Flow Graph to a Context Free
Grammar G with string operations
Calculate the automaton approximation Linear
Grammar of G
Use automaton transformations to represent string
operations, and construct automaton M for the linear
grammar
13
Running Example
public class Tricky{
static String bar (int k, String op) {
if (k==0) return "";
return op+bar(k-1,op)+“]”;
}
static String foo (int n) {
String b = “”;
for (int i=0; i<n; i++) b = b + “(”;
String s = bar(n-1,"*");
return b + s.replace(']',')');
}
public static void main (String args[]) {
String hot = foo(Integer.parseInt(args[0]));
}
}
Hot Spot
Output:
n n-1 n-1
((…(**…*))…)
14
Extracting String Flow Graph
Transform the source code to SSA form
Static Single Assignment form of a program make
sure that each variable is assigned once in the code
Example:
x=“ab”;
x=“cd” + x;
x=“ab”;
x1=“cd” + x;
b = “”
for(i=1:n)
b = b + “(”;
b=“”;
b1=φ(b, b2);
b2= b1 + ”(”;
15
Extracting String Flow Graph
Extracting String Flow Graph graph from SSA
Form F
Rules:
A string variable in F A node in graph
A string assignment in F → An edge in graph
A string operation in F → An operation in
graph
16
String Flow Graph of the Running
Example
X1(hot)
Concat
X2(b1)
X3(b)
X4(b2)
Concat
“”
(
X5(s)
Replace(],))
X6(Return
@bar)
“”
Concat
*
Concat
]
17
Transform String Flow Graph to Context
Free Grammar with operations
Rules:
A node in graph → A Non-Terminal in Grammar G
An edge in graph → A production in Grammar G
A concat operation in graph → A concatenation at the
right hand side of a production
Other operations in graph → An operation at the right
hand side of a production
The node for hot spot in graph → The start Non-
Terminal of Grammar G
18
Context Free Grammar with
operations of the running example
Non-Terminal set: {X1, X2, X3, X4, X5, X6}
Terminal set: {*, (, ], )}
Start Non-Terminal: X1
Productions:
X1→X2X5.replace(],)) X2 →X3 | X4
X3 → X4 →X2( X5 →X6
X6 → | *X6]
19
Normalize the grammar
X1→X2X6
X2 → X11|X2X7
X7 → (
X6 → X5.replace(],))
X5 → X11|X8X10
X8 →X9X5
X9 →*
X10 →]
X11 →
X1→X2X5.replace(],))
X2 →X3 | X4
X3 →
X4 →X2(
X5 →X6
X6 → | *X6]
20
Automaton approximation of the
grammar
Analyze cycles in productions
X1→X2X6
X2 → X11|X2X7
X7 → (
X6 → X5.replace(],))
X5 → X11|X8X10
X8 →X9X5
X9 →*
X10 →]
X11 →
Right generating, can be exactly
represented by an automaton
Both left and right generating
Called non-regular component
Cannot be exactly
represented by an automaton
21
Removing non-regular
components
Mohri - Nederhof Algorithm
Rules: for each non-terminal A in non-regular component M
Do:
A -> X => A ->X A'
A->B => A->B, B'->A'
A->X Y => A->R A', R->X Y
A->X B => A->X B, B'->A'
A->B X => A->B, B'->X A'
A->B C => A->B, B'->C, C'->A'
A->reg => A -> R A', R->reg
B and C represents non-terminals in M
X and Y represents non-terminals out of M
R is a newly added non-terminal
22
Regular approximation of the
running example
Non-regular component:{X5, X8}
X1→X2X6 X2 → X11|X2X7 X7 → (
X6 → X5.replace(],))
X5 → X11X5'
X5 → X8
X8'→ X10X5
X8 →X9X5
X5'→X8'
X9 →*
X10 →]
X11 →
Left generating Now!
23
Dealing with string operations
Build an automaton transformation for each string
operation
For example: replace(],)) can be represented by
replace all the transition labels „]‟ in the input
automaton to „)‟
Transformations can be automatically built
according to the parameters of the operation
24
Construct the automaton
Building the automaton using the Topological
sorting algorithm
First of all, build automatons for the non-terminals that deduce
only terminals. If a non-terminal has an automaton built, we call it
a free non-terminal
Then, build automatons for the non-terminals that deduce only
free non-terminals, and repeat this step
If a non-terminal is involved in a left-generating or right-
generating component, use the classical algorithm to convert the
whole component to an automaton
If a non-terminal is an input of a string operation, use the
transformation of the operation to calculate the output
25
Problems
String operations in a cycle
How to deal with the case below?
Current technique cannot handle it, use the closure
of the character set of X5 as the approximation
X5 X5. replace(],))
X5 {*, )}*
26
CFG Based String Analysis
Context Free Grammar is more expressive
than Automatons
So it is more precise to use CFG to estimate
the possible values of a hot spot
Proposed by Minamide from University of
Tsukuba, Japan, 2005
27
Similarity & Difference
Similarity
Transform the source code to SSA form
Extract String Flow Graph from the SSA form
Transform the String Flow Graph to a CFG with
operations
Difference
Do not calculate the regular approximation
Use FST (Finite State Transducer) instead of
automaton transformations to represent string
operations
28
CFG Based String Analysis
Steps
Generate the CFG with operations
Resolve the string operations in the CFG
using the CFG-FST intersection algorithm
29
Finite State Transitor
Finite State Transducer (FST) is a Finite State
Automaton with output
For each Transition, an FST not only accept a
character, but also output one or more characters
An example:
S1
S2
start
S3
0/1
1/11
0/0
30
FST for string operations
Use FSTs to simulate string operations
Replace
Replace(“ab”, “cd”)
Trim
S1
S2
start
a/
X/X
(X=Σ-b)
b/cd
X/X
(X=Σ-a)
S3
/a
S1
S2
start
space/
S3
space/
X/X
(X=Σ)
X/X
(X=Σ-space)
space/
space
X/X
(X=Σ-space)
31
FST for string operations
Tokenize (explode)
Transform one string operation to two operations
FST for getToken FST for removeToken
S1
S2
start
delim/
X/
(X=Σ)
X/X (X=Σ-
delim)
String str = tokens.nextToken()
String str1 = str.getToken()
String str2 = str.removeToken()
S1
S2
start
delim/
X/X
(X=Σ)
X/
(X=Σ-delim)
32
FST for string operations
Substring
substring(1,2)
S1
start
S2
S3
X/
(X=Σ)
X/
(X=Σ)
X/X
(X=Σ)
33
CFG-FST intersection
Given a CFG G, and a FST T, try to calculate
a CFG G’ , satisfying that:
x G
T (x) G’
, in which x is any string, and T (x) is the output of
T with x as input
34
CFG-FST Intersection Algorithm
Transform the CFL to Chomsky Normal Form (the
right hand sides of all productions contain only
two non-terminals) e.g., S->ABC => S->DC, D-
>AB
For each pair of states in the FST, add an empty
generating non-terminal set
S1 S2a/b
{}
{}
{}
{}
35
CFG-FST Intersection
Algorithm
Initialize the generating non-terminal set of all pairs
of states.
Rule: If transition (s
1
,s
2
) in FST accept character t
and A -> t in CFG, add A to the generating non-
terminal set of (s
1
,s
2
)
S1 S2
a/b
A->a
=>
36
Solution of CFL-Reachability
Problem, cont.
For each non-terminal A on each pair of states
<s
1
,s
2
>, if Bgenerating-set(s
2
,s
x
)
C -> AB Productions, add C to generating-
set(s
1
,s
x
)
S1 S2
{,A}
Sx
{,B}
{,C}
C->AB
S1 S2
{,A}
Sx
{,B}
{}
=>
37
Solution of CFL-Reachability
Problem, cont.
For each non-terminal A on each pair of states
<s
1
,s
2
>, if Bgenerating-set(s
x
,s
1
)
C -> BA Productions, add C to generating-
set(s
x
,s
2
)
Iteratively execute last two steps until no more non-
terminals are added to the generating sets
Each time add a non-terminal to a generating set,
output the production used
The output productions are the intersection of FST
and CFG
38
An Example
S1
S2
S3
a/c
b/c
b/b
S->PQ
P->aPa | b
Q->Qb | b
The FST:
The CFG Grammar:
The Normalized Grammar:
S->PQ
A->a
B->b
P->RA | b
R->AP
Q->QB | b
39
An Example, cont.
S1
S2
S3
a/c
b/c
b/b
{B,P,Q}
{B,P,Q}
{A}
S->PQ
A->a
B->b
P->RA | b
R->AP
Q->QB | b
S1
S2
S3
a/c
b/c
b/b
{B,P,Q}
{B,P,Q}
{A}
{R}
S1
S2
S3
a/c
b/c
b/b
{B,P,Q}
{B,P,Q}
{A,P}
{R}
S1
S2
S3
a/c
b/c
b/b
{B,P, Q}
{B ,P,Q}
{A,P}
{R}
{Q}
40
Output productions used
When initialize the generating sets, output the
production with output terminal instead of the
accepted terminal
S1 S2
a/c
{A}
A
12
->c
S1 S2
a/c
{}
A ->a
41
Output productions used
For the non-terminals added later, use the
rule below:
V1 V2
{,A}
Vx
{,B}
{,C}
C->AB
V1 V2
{,A}
Vx
{,B}
{}
=>
C
1x
->A
12
B
2x
42
Resolve string operations in a
CFG with operations
Resolve the string operations using the
topological sorting algorithm
If the input non-terminals of a string operation Op
deduce pure CFG, resolve Op
Repeat the above step until there are no string
operations in the CFG
Example:
X1->X2X3
X2->X4.replace(*,)) … op1
X4->X5X6
X6->X7.replace([,]) … op2
X7->[X7]+
input of op1:
X4->X5X6
X6->X7.replace([,])
X7->[X7]+
input of op2:
X7->[X7]+
Resolve op2 first,
Then op1
43
Problems
String operations in a deduction cycle
How to deal with the case below?
Current technique cannot handle it, use the closure
of the character set of X7 as the approximation
X7 X7. replace([,])
X7 {], +}*
44
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
45
String Taint Analysis
Purpose
The basic string analysis estimates the possible
values of a hot spot, but it can not determine the
data source of the hot spot
String taint analysis tries to determine the data
source of a given hot spot
The original purpose of string taint analysis is to
determine whether the value of a hot spot comes
from user input
46
Running Example
public class Tricky{
static String bar (int k, String op) {
if (k==0) return "";
return op+bar(k-1,op)+”]”;
}
static String foo (int n) {
String b = “”;
for (int i=0; i<n; i++) b = b + “(”;
String s = bar(n-1,readChar());
return b + s.replace(']',')');
}
public static void main (String args[]) {
String hot = foo(Integer.parseInt(args[0]));
}
}
Hot Spot
Output:
n n-1 n-1
((…(**…*))…)
* is the input char
47
Basic Steps
Extract a CFG with operations from the source code
Add a Boolean taint for each non-terminal and
terminal in the CFG
For each terminal corresponding to a user input
function (e.g., readInput()), set the its taint to true
For each production, propagate the taint value from
the right hand side to the non-terminal at the left
hand side
48
Propagating Taints Through FST
V1 V2
{,A}
Vx
{,B}
{,C}
C(t)->AB
V1 V2
{,A}
Vx
{,B}
{}
=>
C
1x
(t)->A
12
B
2x
V1 V2
a/b
{A}
A
12
(t) ->b
=>
V1 V2
a/b
A(t)->a
49
Generalized String Taint Analysis
Traditional string taint analysis handles only
Boolean values, so it can only differentiate
two data sources of a hot spot
Generalized String Taint Analysis
Use a set instead of a Boolean value to represent a
taint
Allow more complex operations among taints of the
non-terminals/terminals of a production
Example: A(t1) B(t2)C(t3) => t1 = t2 t3
50
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
51
String Order Analysis
Limitations of basic string analysis and string taint
analysis
With basic string analysis and string taint analysis, we are able to
know the possible values and data sources of a hot spot, but we
do not know the order of the data sources appearing in the value
of a hot spot
String order analysis tries to answer questions like
“Is constant string a always after constant string b
when they co-appear in hot spot t?”
52
String Order Analysis
Example
We want to decide whether “abc” is inside a HTML
tag (i.e., whether “abc” is after “<” and before “>”)
$a = „abc‟;
$t = „f<br name=‟;
echo $a.$t.‟de‟.‟>‟;
53
Flag Propagation Algorithm:
Basic Idea
Given a CFG, identify which terminals / terminal
parts are inside tags (i.e., between „<‟ and „>‟)
Basic Solution:
1. Initialize known places (i.e., the terminals containing „<‟ or „>‟),
e.g., T(O)‟f<br name=„(I) O: outside, I: inside
2. Iterate propagating position information (I/O flags) to other
places in the CFG (via a list of rules)
3. End iterations if none of the flags in the CFG changes
abcf<br name=de>
54
Flag Propagation Algorithm
Add a left flag and a right flag to each
variable in the CFG. A flag may be of one of
the four values:
O: Indicate that the place where the flag stays is outside a tag
I: Indicate that the place where the flag stays is inside a tag
U: Indicate that the place where the flag stays is unknown
C: Indicate that the place where the flag stays may be both
inside/outside a tag (e.g. $c=„abc‟; echo $c.‟<tag name=„.$c‟>‟;)
Initialize the flags of terminals
Terminals with „>‟ or „<‟: Initialize with “I” or “O” accordingly
Others: initialize with “U”
55
Flag Propagation Algorithm
Propagate flags in the CFG using the flag
operation and four propagating rules
iteratively
The Flag Operation (+)
When two flags meet, we use the flag operation to calculate the
propagation result of the two flags
U+U = U O+U = O I+U = I
O+O = O I+I = I I+O = C
C+* = C
56
Flag Propagation Algorithm
Four Propagation Rules
Neighboring Rule (for neighboring variables)
S->A(R)(L)B
e.g.: S->A(O)(U)B => S->A(O)(O)B
S->A(U)(O)B => S->A(O)(O)B
Transitive Rule (for terminals without „<‟ and „>‟)
S->(L)‟abc‟(R)
e.g.: S->(O)‟abc‟(U) => S->(O)‟abc‟(O)
Left Deducing Rule
(L)S->(L)AB
e.g.: (U)S->(O)AB => (O)S->(O)AB
Right Deducing Rule
S(R)->AB(R)
57
Example CFG
A‟abc‟
T‟f<br name=„
D‟de‟
E‟>‟
SATDE
abcf<br name=de>
58
Initialization
(U)A(U)(U)‟abc‟ (U)
(U)T(U)(O)‟f<br name=„(I)
(U)D(U)(U)‟de‟ (U)
(U)E(U)(I)‟>‟ (O)
(U)S(U)(U)A(U) (U)T(U) (U)D(U) (U)E(U)
U: unknown
abcf<br name=de>
59
Propagation
(U)A( )( )‟abc‟ ( )
( )T( )(O)‟f<br name=„(I)
( )D( )(U)‟de‟ (U)
( )E( )(I)‟>‟ (O)
(U)S( )(U)A( ) ( )T( ) ( )D( ) ( )E( )
U
O
U
U
U
I
U
U
U
U U U U
O I
U O I I O O
I
O
I
Left Deducing
Rule
(L)S(L)AB
Right Deducing
Rule
S(R)AB(R)
U
O
Neighboring
Rule
SA(R)(L)B
U
U I I
U
O
U O
Transitive Rule (for
terminals without „<‟
and „>‟)
S (L)‟abc‟(R)
abcf<br name=de>
60
Final CFG with
Differentiated Terminals
(O)A(O)(O)‟abc‟ (O)
(O)T(I)(O)f<br name=„(I)
(I)D(I)(I)‟de‟ (I)
(I)E(O)(I)‟>‟ (O)
(O)S(O)(O)A(O) (O)T(I) (I)D(I) (I)E(O)
abcf<br name=de>
61
Conflict Cases
Code
$a = „abc‟
echo $a.‟<‟.$a.‟>‟
CFG
()A(O)„abc‟
(O)B(I)(O)‟<‟(I)
(I)C(O)(I)‟>‟(O)
S(O)A(?)(O)B(I)
A(?)(I)C(O)
Final Result
A(C)„abc‟(C)
B(O)‟<‟(I)
C(I)‟>‟ (O)
S(C)A(C) (C)B(C)
(C)A(C) (C)C(C)
Complication: abc is used both inside and outside tags
Solution: C Flag for Conflict: O + I = C; C + O|I|U|C = C
except for the flags of initialized known places
abc<abc>
62
Example Code
$s = "";
for($i=0;$i<$n;$i++){
$a = "Name:";
$b = "StudentName“.$i.”\””;
$b = “ value=";
$c = $attr."\"default";
$p = $a."<input name=\"“
.$b.$c;
$p = $p."\">";
$s = $s."\n".$p;
$i++;}
echo $table;
Name:<input
name="StudentName$i"
value="default">
PHP Code: HTML Texts:
63
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
64
Why database applications?
Database software projects depends on SQL
queries to manipulate the database
SQL queries are usually dynamically generated to
make the program more flexible
Dynamically generated SQL queries, an example:
Connection con = DriverManager.getConnection ("students.db");
String q = "SELECT * FROM address";
if (id!=0) q = q + “ WHERE studentid=" + id;
ResultSet rs = con.createStatement().executeQuery(q);
65
Recent important applications
Verify the correctness of dynamically
generated SQL queries
Detect SQL injection vulnerability
Determine the impact of database schema
changes
66
Verify the correctness of dynamically
generated SQL queries
Proposed by Christensen et al. in 2003
Purpose:
Verify whether all the possible values of the
dynamically generated SQL queries are legal
according to the SQL syntax
67
An example
Legal dynamically generated SQL queries
Possibly illegal dynamically generated SQL
queries
int id = readInt();
String query = "SELECT * FROM address";
if (id!=0) query = query + “ WHERE studentid=" + id;
int id = readInt();
String query = "SELECT * FROM address";
if (id!=0) query = query + “ WHERE studentid=" + id;
else query = query + “WHERE studentid=" + id;
missing space!!
68
Approach
Identify all the query execution statements in the
source code and mark the variables representing a
query as hot spots
Use basic string analysis to estimate the possible
values of each hot spot t, represented as an
automaton M(t)
Approximate the SQL syntax as a finite state
automaton MS with 631 states, and calculate its
complement MS’
For each t, check whether M(t) MS’ = Φ
69
Evaluation
Evaluation on 9 programs
time in seconds
in Mbs
70
Limitations
Sound but incomplete (may have false
positives)
Can find only syntax errors, cannot find run-
time errors (e.g., type inconsistencies)
71
Detect SQL injection vulnerability
Proposed by Gary Wassermann and
Zhendong Su, 2007
Purpose
Check whether a dynamically generated SQL
query may involve in a SQL injection
vulnerability
72
An Example of SQL injection
Consider the query below:
query = "SELECT * FROM accounts WHERE
name='“+readName()+"' AND password='“+readPassword();
If input „ OR 'a'='a„, we get:
SELECT * FROM accounts WHERE
name='badguy' AND password=„ ' OR 'a'='a'
73
Approach
Build regular policy for each value field in the SQL
statement
For each query and its corresponding CFL, compute
the intersection of the CFL and the regular policy
If the intersection is not empty and contains
substrings from un-trusted source (user input), a
SQL injection is found
74
Evaluation
Evaluation on 5 real world projects
Indirect errors: a user-input string goes to the dangerous
part of a SQL query through the database
Example:
String insert = "insert into table values ("+readString()+"," readInt()+")";
executeQuery (insert);
ResultSet rs = executeQuery ("select * from table");
String query = "select * from table where id="+rs.getString(0);
75
Limitations
Sound but incomplete, may has false
positives
Can not provide test cases for the developer
to understand the vulnerability
76
Determine the impact of database
schema changes
Proposed by Andy Maule et al., in 2008
Purpose:
Determine which statements in the source
code may require fix after a change on the
database schema (e.g., a change on the
name of a table/column, adding/removing
table/columns)
77
Impact of schema change: An
example
queryResult = QueryRunner.Run(
"SELECT Experiments.Name,Experiments.ExperimentId"+
" FROM Experiments"+
" WHERE Experiments.Date={@ExpDate}", dbParams);
schema
Remove this
column
78
Approach
Mark all the SQL queries that goes to a SQL query
execution statement as hot spots
For each hot spot, estimate its possible values using
basic string analysis
For the name of each table column in the schema,
build an automaton like “Σ
*
nameΣ
*
”, which
represents all strings containing the name
Intersect the automaton M(t) of each hot spot t and
of each table column M(c)
M(t) M(c) Φ => a change on c affects t
79
Evaluation
Do evaluation on the irPublish Content Management System, which
consists of 127KLOC C# code
The database include 101 tables and 615 columns
Schema Changes:
Predicted Changes vs.
Real Changes:
80
Limitations
Sound and incomplete, with low precision
because whenever the changed column is
involved in a statement, it raise a warning
81
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
82
Why web applications?
Web-based software projects use html text to
present web pages
Html texts are usually dynamically generated to
make the program more flexible
Dynamically generated html texts, an example in
PHP:
$x = _Post[Color]
$content = _Post[content]
if ($errMsg == "")
echo ("<p><h2><font color='".$x."‟>".$content.
"</font></h2><p>\n");
83
Recent Important Applications
Verify the correctness of dynamically
generated web pages
Detect cross-site-scripting vulnerabilities
84
Verify the correctness of dynamically
generated web pages
Proposed by Minamide in 2005
Purpose:
Verify whether all the possible values of the
dynamically generated web page comply with
the html syntax
85
An example
Legal dynamically generated SQL queries
Possibly illegal dynamically generated SQL queries
If $head == “”, the <h1> tag will
be unclosed due to the missing </h1>
echo "<html>";
echo "<h1>";
if($head!="")
echo $head;
echo "</h1></html>“;
echo "<html>";
echo "<h1>";
if($head!="")
echo $head.</h1>;
echo "</html>“;
86
Approach
Add a statement to concatenate all the outputs of a
web page generating unit (e.g., a .php file), and set
the concatenation result as the hot spot
Use basic string analysis to estimate the possible
values of the hot spot, represented as a CFG G
Approximate the HTML syntax as a finite state
automaton M by limit the recursive depth of the tags,
and calculate its complement M’
Check whether G M’ = Φ
87
Evaluation
Program #lines #non-terminals #productions Time (sec)
webchess
schoolmate
faqforge
phpwims
timeclock
2224
8085
843
726
462
300
7985
180
82
656
450
9505
443
226
1233
0.36
39.92
0.16
0.13
0.15
Evaluation on 6 programs
Time to generate
CFG
Validation Results
Max Recursive Depth
Program Depth Bugs Time (sec)
webchess
schoolmate
faqforge
phpwims
timeclock
9
17
10
9
14
1
14
30
7
11
123.33
7580.69
45.64
63.93
145.61
88
Limitations
Sound but incomplete (may have false
positives)
Can find only syntax errors, cannot find run-
time errors (e.g., script refer to illegal
variables)
89
Detect cross-site-scripting
vulnerabilities
Proposed by Gary Wassermann and
Zhendong Su, 2008
Purpose
Check whether a dynamically generated web
page may involve in a cross-site-scripting
vulnerability
90
Example
An cross-site-scripting vulnerability:
In form.php: <form action=„view.php‟><input id=1
name=„content‟></form>
In view.php:
echo “<div></td>Content: ” . _POST(„content‟)
if we input “<script>badcode</script>” to the „content‟ item of
form.php, bad code goes to view.php
91
Approach
Build regular policy for all the HTML texts that will
invoke a script interpreter
For the CFL of the HTML text, compute the
intersection of the CFL and the regular policy
If the intersection is not empty and contains
substrings from un-trusted source (user input), a
XSS vulnerability is found
92
Evaluation
The
information
of subjects
Result of
the detection
Caused by
user input
Caused by un-initialized
variables, which can be
set by a user when
export global is true in
PHP
93
Limitations
Can not handle DOM-based cross-site-
scripting vulnerabilities which read malicious
code from the DOM
Can not follow complex data flow such as
web page visits and dynamic code
94
Outline
Basic Concepts
Techniques
Basic String Analysis
String Taint Analysis
String Order Analysis
Applications
Database Applications
Web Applications
Software Internationalization
...
...
95
Globalization Process
One-language
Version
Internationalized
Version
English
Property
German
Property
Chinese
Property
Developer
I18n
L10n
All language specific
code elements are
externalized to
property files
I18n Conducted for
Old software projects
New project with no global plan at first
Using old components
I18n
Two Steps:
Internationalization(I18n)
Localization (L10n)
96
Example of I18n and L10n
Original Code Elements
Externalized Code Elements
Property files
97
Language Specific Code
Elements
Constant Strings
Date/Number Formats
Currency/Measures
Writing Direction
Color/Culture related elements
Constant Strings are of the largest number, and some of
them are very hard to be located.
98
Motivation of our work
There are a lot of constant strings
We should not translate all of them
It is sometimes hard to decide which string is
need-to-translate
Application/
Version
#LOC #Constant
Strings
#Need-to-Translate Strings (Not
externalized in the subsequent version)
Rtext0.8.6.9 (Core
Package)
17k 1252 408(121)
Risk1.0.7.5 19k 1510 509(55)
ArtOfIllusion1.1 71k 2889 1221(816)
Megamek0.29.72 110k 10464 1734(678)
99
Basic Idea
We assume that all need-to-translate strings are those
strings that are sent to the GUI
String Variables
/Expressions
GUI
Constant Strings
100
Output API Methods
Output API Methods are methods that pass at least one of
its parameters to the GUI
Example
java.awt.Graphics2D.drawString(java.lang.String, int, int)
drawString 1 false 0
Initial Output Strings are the arguments sent to Output API
Methods
g.drawString (weaponMessage, 30,20)
We locate the string using Eclipse API Search Engine
101
Challenges
String operations (concatenate, tokenize, substring, etc..)
String transmissions:
String Comparisons:
Trivial Strings: “123”, “ ”, “Risk”, …
Client GUI
network Server
Client GUI
String1
String2
Comparison
GUI
String1
String1:part1
String1:part2
GUI
String1:part1
String1:part2
102
Experimental subjects
RText : Simple Editor
Risk : Board Game
ArtOfIllusion : Graph Drawing Project
Megamek : Big Real Time Strategy Game
Application/Version Starting
Month
#Developers #LOC #Files #Constant Strings
RText 0.8.6.9 11/2003 16 17k 55 1252
Risk 1.0.7.5 05/2004 4 19k 38 1510
AOI 1.1 11/2000 2 71k 258 2889
Megamek 0.29.72 02/2002 33 110k 338 10464
103
Bugs found
We found 17 not-externalized need-to-translate
strings in the latest version of Megamek and
reported them as report 2085049. The
developers confirmed and externalized them.
104
Web Applications: Problems
Web applications will not only output user-visible
strings but also tags.
$name = 'Xiaoyin Wang';
$position = 'Ph.D. Candidate';
$part = 'Software Engineering Institute';
$part_ref = 'http://www.sei.pku.edu.cn/';
$univ = 'Peking University';
$univ_ref = 'http://www.pku.edu.cn/';
echo '<DIV><FONT size=6><B>'.$name.'</B></FONT>'
echo '<P>'.$position.'<BR><A href=\"'.$part_ref.'\">'
.$part.'</A><BR><A href=\"'.$univ_ref.'\">'.$univ.'</A>'
Code
HTML
Screen
105
User-Visible Constant Strings in
Web Applications
Constant Strings outside Tags
echo "and pressed 'refresh' on your browser.
In this case, your responses have<br/>\n";
echo "already been saved."
echo "</font></center><br /><br />";
(from question.php, Lime Survey 0.97)
Constant Strings in value attribute of input tags
if (substr(strtolower($reply_subj), 0, 3) != "re:")
$reply_subj = "Re: ".$reply_subj;
echo " <INPUT TYPE=TEXT NAME=passed_subject
SIZE=60 VALUE=\"$reply_subj\">";
(from compose.php, SquirrelMail 0.2.1)
106
Not-visible Constant Strings in
Web Applications
Constant String inside Tags
if ( $t == $timetohighlight) { $c = "red";} else{
$c = "white";
}
echo "<td bgcolor=$c>";
(from day.php3, MRBS version 0.6)
107
Challenges
Differentiate constant strings inside and outside
tags
Identify constant strings that are parts of certain
attribute of certain tags, such as “value” attribute
of <input> tags.
Easy for static html texts, but difficult dynamic html texts
the generated html texts by code can be various and infinite
108
Approach Overview
PHP
Code
String Taint
Analysis
Context Free Grammar
(CFG) Presenting Output
Flag
Propagation
CFG With
Differentiated
Terminals
Input Tag
Checking
Output
Our Approach
Constant String
outside Tags:
abcf<br name=de>
Constant String in
value attribute of input
tags:<input type=text
value=search>
Constant
strings to be
translated
109
Step 1 - String Taint Analysis
PHP
Code
String Taint
Analysis
CFG Presenting
Output
$a = „abc‟;
$t = „f<br name=‟;
echo $a.$t.‟de‟.‟>‟;
A‟abc‟
T‟f<br name=„
D‟de‟ E->‟>‟
SATDE
All Possible
Contents of the
output HTML
abcf<br name=de>
Static detection of SQL injection vulnerabilities, Wassermann and Su PLDI'07
110
Step 2 Tag Range Analysis
CFG Presenting
Output
Flag
Propagation
CFG With
Differentiated
Terminals
Aabc
Tf<br name=
Dde E->‟>‟
SATDE
A->‟abc‟
T->‟f<br name=„
D->‟de‟ E->‟>‟
S->ATDE
abcf<br name=de>
111
Step 3- Input Tag Checking
“input”
inside
tags
Finding
„<‟ before „input‟
No „<‟
„<‟
Stop
Determine
scope
of input tag
“type”/”value”
inside the scope
of input tags
Determine
scope
of type/value
attributes
Terminals
inside type/value
attributes
Output value
Stop
visible
types
other
Determinin
g scopes
Find
before/after
terminal
<input type=text value=search>
<input type=text value=search> <input type=text value=search>
<input type=text value=search>
112
Evaluation Subjects
Three PHP projects
Lime Survey
Squirrel
Mrbs
PJ/Ver #LOC #Constant Strings #Need-to-
Translate
Lime Survey 0.97 11.3K 6493 290
Squirrel0.2.1 4.0K 2457 184
MRBS 0.6 1.4K 704 57
Only a small
percentage of
constant strings are
need-to-translate
432 externalized by developers at v+1 version
62 externalized by developers at later versions
37 manually verified/confirmed by us
113
Evaluation Result
BS: string taint analysis
BS+O: string taint analysis + flag propagation
ALL: string taint analysis + flag propagation + input tag checking
Flag propagation helps find outside-tag constant strings
and reduce false positives greatly
Input tag checking helps find need-to-translate constant
strings inside input tags and reduce false negatives
Most need-to-translate constant strings are outside tags
Our approach has small false positives and reasonable
false negatives
114
Found Constant Strings
Externalized in Later Versions
Our approach found 62 constant strings
(5: Lime Survey, 44: Squirrel Mail, 13: MRBS)
not externalized at the internationalization
but externalized later
Example (smtp.php of Squirrel Mail, externalized 3 years
later)
115
Thank you!
116