Path-and index-sensitive string analysis based on monadic second-order logic Winner of the ACM Distinguished Paper Award

Path- and Index-sensitive String Analysis Based on
Monadic Second-order Logic
Takaaki Tateishi
IBM Research - Tokyo
[email protected]
Marco Pistoia
IBM Research - T. J. Watson
Research Center
[email protected]
ABSTRACT
Omer Tripp
IBM Software Group and
Tel Aviv University
[email protected]
applications, where string values used in security-sensitive computations are compared to safe and/or unsafe string patterns to detect potential security vulnerabilities, such as cross-site scripting
(XSS), HTTP response splitting (HRS) and Structured Query Language (SQL) injection (SQLi) [2].
We propose a novel technique for statically verifying the strings
generated by a program. The verification is conducted by encoding
the program in Monadic Second-Order Logic (M2L). We use M2L
to describe constraints among program variables and to abstract
built-in string operations. Once we encode a program in M2L, a
theorem prover for M2L, such as MONA, can automatically check
if a string generated by the program satisfies a given specification,
and if not, exhibit a counterexample. With this approach, we can
naturally encode relationships among strings, accounting also for
cases in which a program manipulates strings using indices. In
addition, our string analysis is path sensitive in that it accounts for
the effects of string and Boolean comparisons, as well as regularexpression matches.
We have implemented our string-analysis algorithm, and used
it to augment an industrial security analysis for Web applications
by automatically detecting and verifying sanitizers—methods that
eliminate malicious patterns from untrusted strings, making those
strings safe to use in security-sensitive operations. On the 8 benchmarks we analyzed, our string analyzer discovered 128 previously
unknown sanitizers, compared to 71 sanitizers detected by a previously presented string analysis.
1.1
String Analysis for Security
General Terms
A common way of conducting string analysis is by constructing
context-free grammars or regular grammars to approximate strings
[8, 23, 33]. With this approach, each built-in string operation is
modeled by a grammar transducer. This form of string analysis
is suitable for analyzing code that sanitizes strings using stringmanipulation operations such as Java’s replace method. Many
Web applications fall in this category because they sanitize their inputs by removing potentially malicious string patterns or replacing
them with safe ones. Sanitizers often perform validation against
certain patterns, and process the inputs only if validation succeeds.
For these validation-based sanitizers, a path-sensitive string analysis is necessary. Conversely, path-insensitive string analyses will
conservatively report violations even when proper validation takes
place.
In addition, a very large number of Web applications perform
sanitization by extracting substrings from input strings, starting at
specific indices. Grammar-based string analyses are unable to precisely verify strings that are constructed in this way, and will have
to report violations conservatively even when proper index-based
sanitization has taken place. Consequently, path-sensitivity and the
ability to model index-based string manipulation are essential features when verifying Web applications for security.
Languages, Security, Verification
1.2
Keywords
The Java method clean in Figure 1 can be used to prevent an
XSS attack. The input to the method can be any possible value, including values potentially under the control of an attacker. In XSS,
an attacker typically wraps JavaScript code into a (<script>,
</script>) tag pair, and embeds it into text that becomes part
of an online encyclopedia, blog, or social network. Once the text
is rendered on other people’s browsers, the embedded code is automatically executed on the victims’ computers. For this example,
we consider the output safe if it does not contain character <.
In order to verify that the program is immune to XSS attacks,
we need to prove that no string generated by the program contains
<. According to this specification, clean is considered a valid
XSS sanitizer; when condition v1.contains(v2) holds, < is
effectively removed from the input string by combining indexOf
and substring, and when that condition does not hold, the string
value returned by the method is the same as the input string, which
does not contain <. However, a path-insensitive grammar-based
string analysis cannot follow this line of reasoning, since it would
Categories and Subject Descriptors
D.2.4 [Software Engineering]: Software/Program Verification;
D.2.5 [Software Engineering]: Testing and Debugging
String Analysis, Static Program Analysis, Web Security
1. INTRODUCTION
String analysis [7, 8, 12, 17, 19, 23, 33] is a particular form of
program analysis whose purpose is to infer string values arising
at run time. It is often used in the verification of server-side Web
Permission to make digital or hard copies of all or part of this work for
Permission
make digital
hard copies
of fee
all provided
or part ofthat
thiscopies
work are
for
personal or to
classroom
use isorgranted
without
personal
granted
without feeadvantage
provided and
thatthat
copies
are
not madeororclassroom
distributeduse
for is
profit
or commercial
copies
not made or distributed for profit or commercial advantage and that copies
bear this
this notice
notice and
and the
the full
full citation
citation on
bear
on the
the first
first page.
page. To
To copy
copy otherwise,
otherwise, to
to
republish, to
to post
post on
on servers
servers or
republish,
or to
to redistribute
redistribute to
to lists,
lists, requires
requires prior
prior specific
specific
permission and/or
and/or aa fee.
fee.
permission
ISSTA ’11,July
July17–21,
17-21, 2011,
2011, Toronto,
Toronto, ON,
ON, Canada
Canada
ISSTA’11,
Copyright 2011 ACM 978-1-4503-0562-4/11/05 ...$10.00
...$10.00.
166
Motivating Example
String clean(String v1){
String v2 = "<";
if (v1.contains(v2)) {
int v3 = v1.indexOf(v2);
String v4 = v1.substring(0, v3);
return v4;
}
return v1;
}
(Position variable)
(Position-set variable)
(Position term)
(Position-set term)
(Formula)
Figure 1: Sanitization against XSS
p ∈ Var1
P ∈ Var2
t ::= p | t + i | t − i | $ | 0
T ::= ∅ | {t, . . . , t} | all | P | T ∪ T | T ∩ T
| T \ T | T −1
ϕ ::= ‘a’(t) | t = t | t < t | t ≤ t
| T =T |T ⊂T |T ⊆T |t∈T
| ¬ϕ | ϕ ∧ ϕ | ϕ ∨ ϕ | ϕ ⇒ ϕ | ϕ ⇔ ϕ
| ∃p.ϕ | ∀p.ϕ | ∃P.ϕ | ∀P.ϕ
Figure 2: Syntax of M2L(Str)
fail to capture the relationship between v2 and v3, and thus the
effect of the ensuing substring operation. As a consequence, it
produces a resulting grammar that conservatively contains <.
Furthermore, to the best of our knowledge, PISA is the first
purely static string analysis that simultaneously handles index
sensitivity, path sensitivity, and string-replacement operations.
• Sanitizer detection by path- and index-sensitive string analysis.
String analysis has already been used for sanitizer detection [4].
However, [4] uses an imprecise string analysis, which is neither
index nor path sensitive, and compensates for this loss in precision by relying on a complementary dynamic analysis. PISA, on
the other hand, is much more precise, which obviates the need
for an accompanying dynamic analysis. This also enables scanning applications during the development phase, where they cannot yet be deployed (and thus dynamic analysis cannot be used),
which is the optimal stage for detecting security vulnerabilities.
• Implementation and evaluation. PISA is fully implemented and
is featured in a commercial security product [1]. We evaluated
PISA’s precision by comparing it with the technique of [23, 12]
on 8 open-source benchmarks. We further examined PISA’s effectiveness by integrating it into a commercial taint-analysis algorithm. The results show PISA to be far more precise than the
previous technique, and also effective in boosting the precision
of its client taint analysis.
1.3 Our Approach
To abstract string values, we use M2L(Str) (Monadic Secondorder Logic on strings) [16]. The effect of branch conditions and
dependencies among program variables is abstracted and encoded
as M2L formulae. Built-in string operations are also abstracted
by M2L formulae, with each formula representing relationships
among input and output parameters. In particular, a string operation using an index can be represented naturally by a M2L formula,
since M2L(Str) is capable of explicitly mentioning positions in a
given finite string and can deal with variables ranging over positions (position variables) or variables ranging over sets of positions
(position set variables) on the finite string.
The use of M2L(Str) has the following advantages in addition to
enabling index sensitivity combined with path sensitivity:
• Conservativeness. M2L(Str) captures not only fixed-size strings
but also finite strings (regular languages). This feature is necessary for guaranteeing that our string analysis is conservative,
which implies that sanitization code verified by our string analysis can be safely used as a sanitizer, and for conservatively modeling several important built-in string operations such as Java’s
replace method in a manner similar to finite-state transducers
that cannot be captured by fixed-size representation.
• Efficient and effective automaton representation. We can exploit an automatic theorem prover MONA [20] to implement our
string analysis algorithm, where MONA uses the BDD-based automaton representation of M2L formulae. Furthermore, the use
of MONA has potential to advance the string analysis implementation in the future, since MONA enables performing separate
compilation and generating constraints on input strings (like vulnerability signatures [6, 35]) including counterexamples.
Our analysis consists of the following two automated processes:
(i) encoding a string-manipulating method as an M2L formula ϕ1
that represents possible strings returned by the method, and (ii) encoding a regular expression indicating unsafe strings as an M2L
formula ϕ2 , and checking the satisfiability of ϕ1 ∧ ϕ2 to verify that
the possible strings returned by the method never contain any of
unsafe strings, where the method is reported as a sanitizer iff the
formula is unsatisfiable. In the first process, the effect of branch
conditions and index-based string manipulations are also encoded
in M2L, and reflected to the M2L formula ϕ1 . Therefore, our string
analysis is both path-sensitive and index-sensitive, and thus we call
it PISA (Path- and Index-sensitive String Analysis) in this paper.
1.5 Organization
The rest of the paper is organized as follows: In Section 2, we
present the overview of our string analysis algorithm. The core
string-analysis algorithm is described in Section 3. Then, in Section 4, we extend the core algorithm with index sensitivity and path
sensitivity. In Section 5, we extend our analysis to become interprocedural. Section 6 discusses our implementation of the algorithm,
as well as experimental results. Section 7 surveys related work, and
Section 8 concludes this paper.
2.
OVERVIEW
Our string analysis verifies a program by encoding it in M2L(Str)
and then checking the satisfiability of an M2L formula. Therefore,
we first present the definition of M2L(Str), and then briefly describe
how to encode strings and programs in M2L(Str).
2.1
Monadic Second-order Logic on Strings
M2L(Str) is a widely used vehicle for a variety of verification
problems [16]. The syntax of M2L(Str) is defined in Figure 2,
where Var1 denotes a set of position variables and Var2 a set of
position-set variables. The formula ‘a’(t) holds if ai in (finite)
string w = a0 · · · an−1 is ‘a’, where i is the interpretation of t.
Constants 0 and $ represent the first and last positions in a string,
respectively. The addition t + i of position-term t and natural number i is interpreted as t + i = j + i mod n, where j is the interpretation of t, and n is the length of string w. T + i, where T is a
position-set term, results in position set {t + i | t ∈ T }. t − i and
T − i are interpreted similarly.
Its semantics is determined by checking whether an M2L formula ϕ holds on a finite string w ∈ Σ∗ and an assignment I ∈
1.4 Contributions
This paper makes the following contributions:
• Novel features enabled by M2L. Our encoding method goes beyond that for regular expressions [20]. Compared to existing
string analyses based on bit-vector logic and/or word equation [5,
19, 25], our M2L-based approach can model more string transformations such as replacement and upper-case transformations.
167
progv1 (V1 )
progv2 (V2 )
progv3 (v3 , V1 )
prog0 (v0 , V1 )
progv1’ (V1 )
progv4 (V4 )
progv1” (V1 )
≡
≡
≡
≡
≡
≡
≡
true
“<”(V2 )
[|indexOf|] (v3 , V1 , progv1 , progv2 )
min(v0 , V1 )
progv1 (V1 ) ∧ [|contains("<")|] (V1 )
[|substring|] (V4 , progv1’ , prog0 , progv3 )
progv1 (V1 ) ∧ ¬ [|contains("<")|] (V1 ),
we introduce the program variables v1’ and v1” to distinguish the
program variable v1 used in the “then” block of the if-statement
from that returned at the end of the method. The variables V1 ,
V2 , v3 , and V4 are also M2L variables associated with the program variables v1,v2,v3, and v4, respectively, where upper-case
variables V1 ,V2 , and V4 are position set variables each of which
represents a string, and the lower-case variable v3 is a position
variable which represents an index. Note that the effect of the
branch condition is encoded as constraints on these M2L variables
in the declarations of the predicates progv1’ and progv1” , where
[|contains("<")|] (V1 ) 2 is the abstraction of the condition co
ntains("<"), which means that the string represented by V1
contains the string “<”. The predicate prog0 and progv3 means
index 0 and an index assigned to the program variable v3, where
each of the predicates takes two parameters: a position variable and
a position set variables, since we represents an index of a string
using the pair of a position and a position set. [|indexOf|] and
[|substring|] are the abstractions of string operations indexOf
and substring as in the case of [|concat|], respectively. The
details of these abstractions are discussed later in Section 3 and 4.
If our only concern is that the string “<” is unsafe, an unsafe
specification Unsafe for the program is defined as Unsafe(V ) ≡
∃R′ . substr(R′ , V ) ∧ “<”(R′ ), where substr(R′ , V ) means that
the string R′ is a substring of V . Alternatively, we can use the
regular expression “.*<.*” which is equivalent to the specification, since we can encode the regular expression into M2L. We then
verify that the program never return a string containing the unsafe
string by confirming the unsatisfiability of this formula:
∃V . (progv1” (V ) ∨ progv4 (V )) ∧ Unsafe(V ).
Figure 3: An example of predicate declarations
(P → 2Pos ), where P is the set of free position set variables1 ,
and 2Pos is the power set of the position set Pos. When the formula
holds, we write w, I |= ϕ. For example, the M2L formula ‘a’(0) ∧
‘b’(1) ∧ ‘c’(2) ∧ ‘a’(3) holds on the finite string w = “abca”,
which states that character ‘a’ is located at positions 0 and 3, while
characters ‘b’ and ‘c’ are located at positions 1 and 2, respectively.
Thus, we can write w, I |= ‘a’(0)∧‘b’(1)∧‘c’(2)∧‘a’(3), where
I = {}.
2.2 Encoding Programs in M2L(Str)
Our encoding method treats string values using position sets without loss of the order of characters. For example, given the formula
‘a’(0) ∧ ‘b’(1) ∧ ‘c’(2) ∧ ‘a’(3) ∧ P = {0} ∧ Q = {1, 3} ∧ R =
{0, 1, 3} holds on w = “abca” and I = {P 7→ {0}, Q 7→
{1, 3}, R 7→ {0, 1, 3}}, the position set variables P ,Q, and R can
be considered to be the strings “a”, “ba” and “aba”, respectively.
In this representation of strings, the concatenation of the two strings
represented by P and Q is captured by P ∪ Q, which is equal to
R, without loss of the order of characters, since all of the positions
in P are less than any position in Q. Thus, this concatenation relationship can be represented by the predicate:
3.
concat(R, P, Q) ≡
(R = P ∪ Q) ∧ (∀p, q . p ∈ P ∧ q ∈ Q ⇒ p < q),
where p < q ensures the order of the characters in P and Q. Based
on this predicate, we introduce the notation “s”(S) which means
that position-set variable S represents string s.
With this encoding of strings, a program is encoded as a set of
M2L predicate declarations, where each M2L predicate is declared
corresponding to the definition of a program variable. We also use
a pre-defined predicate for every string operation to represent each
constraint among the return value and the parameters, where any
constraint can be abstracted. For example, let us consider the following two-line program.
3.1
Target Language
We assume that the target program is translated to Static Single Assignment (SSA) form [24, 10]. An SSA program comprises
numbered basic blocks, each of which consists of the instructions
in Figure 4. In addition, we use the notation op(h) for built-in operator h. For example, x=op(+)(1,2) assigns the result of 1+2
to program variable x. Note that the return instruction and the instruction of calling a user-defined function are introduced only for
reflecting actual program languages such as Java, as the core of our
string analysis algorithm is intraprocedural. We omit the details of
the method of encoding those types of instructions in Section 3.
However, this does not restrict us from conducting an interprocedural analysis, since our algorithm can be simply extended so as
to conduct the interprocedural analysis by treating assignment relationships between caller’s program variables and callee’s program
variables as described in Section 5.
The following SSA program represents the clean method introduced in Section 1, where 1:, 2:, and 3: represent the basic
blocks numbered 1,2, and 3.
String v1 = "a";
String v2 = v1.concat(v1);
Here is the set of predicate declarations when this program is encoded into M2L.
progv1 (V1 ) ≡ “a”(V1 )
progv2 (V2 ) ≡ [|concat|] (V2 , progv1 , progv1 )
where [|concat|] (R, P1 , P2 ) ≡
∃P1 , P2 . P1 (P1 ) ∧ P2 (P2 ) ∧ concat(R, P1 , P2 )
Each predicate progvi represents the post-condition for the assignment to program variable vi , where the parameters V1 and V2 of the
predicates are associated with the program variables v1 and v2,
respectively. [|concat|] is a pre-defined predicate representing an
abstraction of the string concatenation operation, where the return
value is represented by R, and the parameters are represented by
P1 and P2 which are instantiated by the predicates associated with
the program variables.
With this encoding of programs, the clean method of Figure 1
is encoded as the set of predicate declarations of Figure 3, where
1 Position
CORE ALGORITHM
In this section, we first describe our target language, and then
describe the method of encoding strings and regular expressions
followed by the method of encoding a program as a set of M2L
predicate declarations. We also present a formal discussion of our
encoding method, accompanied by a soundness theorem.
1:v0 =
v2 =
b1 =
jump
0;
2:return v1;
"<";
3:v3 = indexOf(v1,v2);
v1.contains(v2); v4 = substring(v1,v0,v3);
b1, 3
return v4;
2 The reason why we do not use [|contains|] (V , V ) is that M2L as well
1
2
as our encoding method cannot treat equality of strings. Therefore, our
encoding method relies on constant propagation analysis to obtain concrete
strings such as “<” to avoid this limitation.
variables are handled by using singleton position set variables.
168
(Assignment)
(Function call)
x=v
x = f (x1 , · · · , xn )
(ϕ function)
x = phi(b1 : x1 , . . . , bn : xn )
(Conditional jump)
(Goto)
(Return)
jump x, b
goto b
return x
The value v is assigned to program variable x.
The result of invoking function f with parameters x1 , · · · , xn is assigned to program variable x. f is either a built-in function or a user-defined function.
When an immediate predecessor of the current basic block is bi , program-variable x is
assigned the value of xi . Basic-block numbers are omitted for brevity when possible.
The program jumps to the basic block numbered b if the value of x is true.
The program jumps to the basic block numbered b.
The value of the program variable x is returned.
Figure 4: Instructions of the target language
“a1 · · · an ”(P )
≡
concat(R, P, Q)
≡
strr(R, P, p, q)
≡
substrr(R, P, p, q)
≡
substr(R, P )
consecutive(p, q, R)
≡
≡
∃t1 , · · · , tn .
‘a1 ’(t1 ) ∧ · · · ∧ ‘an ’(tn )
∧ t1 < t2 ∧ t2 < t3 ∧ · · · ∧ tn−1 < tn
∧ P = {t1 , · · · tn }
R=P ∪Q
∧(∀p, q . p ∈ P ∧ q ∈ Q ⇒ p < q)
p ≤ q ∧ R ⊆ P ∧ (∀r . r ∈ P
⇒ (r ∈ R ⇔ p ≤ r ∧ r < q))
∃p′ , q ′ . p ≤ p′ ∧ p′ ≤ q ′ ∧ q ′ ≤ q
∧strr(R, P, p′ , q ′ )
substrr(R, P, min(P ), max(P ) + 1)
p<q∧p∈R∧q ∈R
∧(∀r . p < r ∧ r < q ⇒ r ̸∈ R)
[|replace|] (R, Ps , Px , Py ) ≡
∨
′
′
′
v∈V (∃S, X, Y . Ps (S) ∧ (∀S . substr(S , S \ X) ⇒ ¬“v”(S ))
∧ ⟨⟨v ⋆ ⟩⟩′ (X, S) ∧ ⟨⟨y ⋆ ⟩⟩ (Y ) ∧ ⟨⟨(vy)⋆ ⟩⟩ (X ∪ Y ) ∧ S ∩ Y = ∅
∧R = ((S \ X) ∪ Y ))
where
⟨⟨v ⋆ ⟩⟩′ (X, S) ≡ ∃P . min(X) ∈ P ∧ max(X) + 1 ∈ P
∧∀r, r′ . consecutive(r, r ′ , P )
⇒ ∃Q . strr(Q, X, r, r′ ) ∧ substr(Q, S) ∧ ⟨⟨v⟩⟩ (Q)
[|indexOf|] ∨
(p, P, P1 , P2 ) ≡
P1 (P ) ∧ v∈V ((∃P2 . “v”(P2 ) ∧ indexOf(p, P, P2 ))
∧(min(P ) ≤ p ⇒ (∀P2 , p′ . “v”(P2 ) ∧ indexOf(p′ , P, P2 )
∧min(P ) ≤ p′ ⇒ p ≤ p′ )))
where
indexOf(p, P, Q) ≡ (substr(Q, P ) ⇒ ((Q ̸= ∅ ⇒ min(Q) = p)
∧(Q = ∅ ⇒ min(P ) = p)))
∧(¬substr(Q, P ) ⇒ p < min(P )).
Figure 5: Utility predicates
⟨⟨T ⟩⟩
⟨⟨Tx ⟩⟩
⟨⟨N1 N2 ⟩⟩
→
→
→
⟨⟨N1 | N2 ⟩⟩
⟨⟨N ⋆ ⟩⟩
→
→
λS . ‘T ’(S)
λS . progx (S)
λS . ∃S1 , S2 . ⟨⟨N1 ⟩⟩ (S1 ) ∧ ⟨⟨N2 ⟩⟩ (S2 )
∧concat(S, S1 , S2 )
λS . ⟨⟨N1 ⟩⟩ (S) ∨ ⟨⟨N2 ⟩⟩ (S)
λS . ∃P . min(S) ∈ P ∧ max(S) + 1 ∈ P
∧∀r, r ′ . consecutive(r, r ′ , P )
⇒ ∃Q . strr(Q, S, r, r ′ ) ∧ ⟨⟨N ⟩⟩ (Q)
[|substring|] (R, Ps , Pn , Pm ) ≡
∃S, n, m . P(S) ∧ Pn (n, S) ∧ Pm (m, S) ∧ strr(R, S, n, m)
[|contains,
1|] (R, c, progv1 , progv2 ) ≡

.
prog
 ∃P
v2 (P ) ∧ substr(P, R)
∨
¬(∃P . “s”(P ) ∧ substr(P, R))
 s∈S
true
Figure 6: Encoding regular expressions
when c = true
when c = false
otherwise
Figure 7: Abstractions of the built-in functions
3.2 Encoding Strings
tion λS.ϕ, instead of explicitly declaring a new predicate ψ such
that ψ(S) = ϕ.
Our encoding method treats a string value of size m as a position
set P of the same size. The positions in P are taken from a “global”
position set, {0, . . . , n − 1}, that represents a word w. Formally,
if w = a0 · · · an−1 and P = {p0 , · · · , pm−1 } is a sorted position
set, then P represents the string value s (of size m) iff P satisfies
s = ap0 ap1 · · · apm−1 . Given w = “abca”, the sets of positions
{0, 1} and {2, 3} represent the strings “ab” and “ca”, respectively.
Figure 5 lists utility predicates used for the encoding, where
“a”(P ) and concat(R, P, Q) are the same as those introduced in
Section 2.2. Intuitively, strr(R, P, p, q) denotes that a string represented by R is the substring represented by P containing all the
characters in the range [p, q). substrr(R, P, p, q) is similar to strr,
the difference being that R may be any substring. min(P ) and
max(P ) return the minimum and maximum positions in P , respectively. Finally, predicate consecutive(p, q, R) denotes that
positions p and q are consecutive in position set R. This predicate is used to encode the Kleene closure of a regular language (cf.
consecutive_in_set, as described in [20]).
Our algorithm for encoding regular expressions is the same as
that of [20], except that we accept program variables as terminal
symbols. Figure 6 shows how to encode a set of strings represented
by a regular expression r as a predicate denoted by ⟨⟨r⟩⟩. T represents a terminal symbol (i.e., a character), and Tx represents a
terminal symbol associated with program variable x, where progx
denotes a property of strings possibly assigned to x. N ,N1 , and
N2 represent nonterminal symbols. In addition, we use the nota-
3.3
Abstracting Built-in Functions
We denote the abstraction of a built-in function f by [|f |]. The parameters of a built-in function are implicitly represented by higherorder variables, each of which is instantiated by a predicate representing a property of the relevant actual parameter. Thus, all
higher-order variables are instantiated by the end of the encoding
process. For example, the higher-order variables P1 and P2 used
in the encoding of the string-concatenation program in Section 2.2
are instantiated by predicate progv1 , thus yielding:
progv2 (V2 ) ≡
∃P1 , P2 . progv1(P1 ) ∧ progv1(P2 ) ∧ concat(V2 , P1 , P2 ).
The examples of the abstractions we developed for the built-in
functions are listed in Figure 7. Here, we focus only on [|replace|],
which abstracts the Java method replace, where s.replac
e(x,y) substitutes all occurrences of x in s with y. Discussion
of the other functions is deferred to Section 4, where extensions of
the core algorithm needed by the corresponding abstractions are introduced. In our abstraction of replace, V is the set of concrete
strings possibly assigned to x, X represents the set of positions
to be removed from position-set S, and Y represents a set of positions to be inserted. Predicate ⟨⟨y ⋆ ⟩⟩ encodes regular expression
y ⋆ , where y is the program variable corresponding to Py . Predicate ⟨⟨(vy)⋆ ⟩⟩ constrains X and Y to guarantee that each pair of
removed and inserted positions is consecutive, and predicate ⟨⟨v ⋆ ⟩⟩′
encodes regular expression v ⋆ , and constrains X to contain only
169
[|x := v|]
→progx (R) ≡ “v”(R)
[|x := f (x1 ,· · ·, xn )|] →progx (R) ≡ [|f |] (R, progx1 ,· · ·, progxn )
[|x := phi(x1 ,. . ., xn )|] →progx (R) ≡ progx1 (R) ∨· · ·∨ progxn (R)
The possible set of strings assigned to v2 can be represented by
CFG v2 → v0 | v3 , v3 → v2 v1, where v0 and v1 are considered
terminal symbols (being non-cyclic). This CFG is overapproximated by the regular expression v0 v1⋆ , which is encoded in M2L
as ⟨⟨v0 v1⋆ ⟩⟩. Here are the resulting predicate declarations:
Figure 8: Encoding Instructions
progv0 (R) ≡ “ab”(R)
progv1 (R) ≡ [|toUpperCase|] (R, progv0 )
progv2 (R) ≡ ⟨⟨v0 v1⋆ ⟩⟩ (R)
progv3 (R) ≡ ∃V1 , V2 .progv2 (V2 ) ∧ progv1 (V1 )
∧concat(R, V2 , V1 )
When cyclic variable x is affected by other string operations, we
use the naïve abstraction progx (R) ≡ true, which holds for any
string value.
Alternatively, we could automatically find a loop invariant invx ,
for x, using the character-set abstraction [8], where a position set
variable is used as a set of characters by ignoring the order of characters in a string. This predicate can be used instead of progx .
Note that the strongest loop invariant on character sets, in the sense
of the smallest character set that satisfies cyclic string constraints,
is necessary, since our purpose is to check the existence of an unsafe string. Otherwise, the naïve abstraction is allowed to be a loop
invariant. In addition, finding the smallest character set requires
the subset relation between character sets, whereby the equality of
characters is also required. Our experience suggests, however, that
computing the smallest character set, which involves checking the
equality of characters, is an expensive process whose advantage
over the naïve approach is negligible.
positions removed from S. The reason why we need to compute
the set V of the concrete strings is that our analysis is designed
to be conservative. (As will be discussed later, indexOf and
contains pose a similar requirement.) Consider the following
Java code fragment:
String t = some_condition ? "a" : "b";
String u = "ab".replace(t,"z");
where t’s value is either “a” or “b”. However, replacing “a” and
“b” in the string “ab” with “z” yields “zz”, while the actual result
should be either “az” or “zb”.
In practice, a separate analysis technique (e.g., constant propagation [34]) can be used to obtain the set of concrete values corresponding to the relevant strings, which are then reflected in the
abstraction. We emphasize that other string-analysis techniques
also require this information. For example, both Minamide’s algorithm [23] and JSA [8] approximate the replace("a","b")
operation with an automaton, but not the function replace.
Having reviewed our abstraction of replace, we note that in
some cases, defining the abstraction of a built-in function using a
finite-state transducer, as described in [23], is easier than directly
constructing an M2L formula. We use a simple approach to translate a string transducer into an M2L predicate: Since a transducer
can be viewed as a finite-state automaton with output characters,
we can represent it using the regular-expression notation, while denoting tuples of input characters and output characters as described
in [18]. For example, a transducer for replace("a","b") can
be represented by ((âb)|(ÂA))⋆ , where A matches any character,
and the notation ĉo1 . . . on means that c is an input character followed by the output characters o1 , · · · , on . We can then encode
this regular expression using the algorithm in Figure 6.
3.5
Here, we describe soundness of our encoding method. Note
that only loop-free programs are addressed, since we rely on the
grammar-based abstraction by [8, 23] and the trivial widening operation that always yields the naïve abstraction progx (S) = true
to handle loops.
We first introduce the notation Lw,I (ψ) to denote the set of
strings represented by R that satisfy ψ(R) given finite string w
and assignment I. 3
3.4 Encoding Instructions
D EFINITION 1 (G ENERATED L ANGUAGE ).
Lw,I (ψ) ≡ {s | w, I |= ∀R . “s”(R) ⇒ ψ(R)}
Encoding an instruction amounts to producing a set of M2L predicate declarations. Similar to the abstraction of a built-in function,
we denote the encoding of instruction I by [|I|].
We say that a left-hand variable is cyclic if it is defined depending on itself. Such a cyclic variable can only appear in a program
with loops (and recursions). If there are no cyclic variable definitions, then the abstractions in Figure 8 apply. The abstractions
of return instructions, calls to user-defined functions, goto instructions, and jump instructions are omitted, since this section assumes
an intraprocedural and path-insensitive analysis for simplicity.
When a cyclic variable is affected only by string concatenations,
we use the approach of [8]: We first construct a Context-Free Grammar (CFG) for the cyclic variable, where cyclic variables are considered nonterminal symbols, and then approximate the resulting
CFG with a regular expression. Finally, we encode the regular expression in M2L using the algorithm shown in Figure 6.
As an example, consider the following loop program and its corresponding SSA program containing cyclic variables v2 and v3:
String v0 = "ab";
1:v0 =
v1 =
String v1 =
2:v2 =
v0.toUpperCase();
i0 =
String v2 = v0;
b0 =
while (v2.length()<10) {
b1 =
v2 = v2.concat(v1);
jump
}
...
3:v3 =
goto
4: ...
Soundness of the Encoding Method
The set [|P |] of predicate declarations, which is obtained by encoding the program P , is sound if for every program variable x,
there exists a finite string w and an assignment I, such that v ∈
Lw,I (progx ), where v is a string value assigned to variable x and
[|P |] = {progx1 , · · · , progxn }.
T HEOREM 1 (S OUNDNESS OF THE E NCODING M ETHOD ).
∃w, I . ∀x ∈ dom(σ) . σ(x) ∈ Lw,I (progx )
where σ represents a program state (interpreted as a mapping from
program variables to values).
The above soundness criterion holds as long as for every string operation f , the abstraction [|f |] satisfies the following condition:
D EFINITION 2 (S OUND A BSTRACTION OF F UNCTION f ).
∀r, p1 , · · · , pn . r = f (p1 , · · · , pn )
⇒ ∀w, I, ψ1 , · · · , ψn . p1 ∈ Lw,I (ψ1 ) ∧ · · · ∧ pn ∈ Lw,I (ψn )
⇒ ∃w′ . r ∈ Lww′ ,I (λR. [|f |] (R, ψ1 , · · · , ψn ))
"ab"
toUpperCase(v0)
phi(1:v0,3:v3)
length(v2)
op(<)(i0,10)
op(neg)(b0)
b1, 4
concat(v2,v1)
2
where ww′ is the concatenation of finite strings w and w′ , and
Lw,I (λR.ψ(R)) is short for {s | w, I |= ∀R.“s”(R) ⇒ ψ(R)}.
3 This definition can be viewed as a concretization function in abstract interpretation, where there is no best abstraction as in the case of regular
languages [9], which is not a complete partial order.
170
[|x := n|]
→progx(p,S) ≡ posn (p, S)
[|x := f (x1 ,· · ·, xn )|] →progx(p,S) ≡ [|f |] (p, S, progx1 ,· · ·, progxn )
[|x := phi(x1 ,. . ., xn )|]→progx(p,S) ≡ progx1(p,S)∨· · ·∨progxn(p,S)
progv4 (R) ≡ ∃V1 , v0 , v3 . progv1 (V1 )
∧progv0 (v0 , V1 )∧progv3 (v3 , V1 )∧substrr(R, V1 , v0 , v3 )
4.2 Handling Branch Conditions
Figure 9: Additional abstraction of instructions for indices
PISA employs a simple form of path sensitivity, which provides
the ability to record the effects of branch conditions on a specific
variable by encoding them as M2L predicates. For example, branchcondition v.equals("a") constrains variable v. Thus, if the
test succeeds, we encode the relevant constraint as M2L-predicate
ϕv (R) = “a”(R). When encoding the true branch of a condition
as a set of predicate declarations, we use predicate prog′v (R) =
progv (R) ∧ ϕv (R), rather than progv (R), to represent the values
possibly assigned to program-variable v.
Figures 10 and 11 present the encoding for path-sensitive analysis, where the notation [|I|]b represents encoding instruction I in
basic block b. Boolean operators ∧ and ∨ are used for notational
brevity: For predicates ψ1 and ψ2 , ψ1 ∧ ψ2 [ψ1 ∨ ψ2 ] represents a predicate ψ such that ψ(R) = ψ1 (R) ∧ ψ2 (R) [ψ(R) =
ψ1 (R) ∨ ψ2 (R)]. In addition, we lift the lambda notation λR.ψ ′
to represent a predicate ψ such that ψ(R) = ψ ′ , without explicitly
declaring predicate ψ.
Path-condition [27, 15] P C(b0 , b′ ) represents a necessary condition for flow from basic block b0 to b′ . We use the notation
P C ′ (b0 , b, b′ ) to distinguish between ϕ-induced assignments, and
thus represent a necessary condition for flow from b0 to b′ through
b, which is an immediate predecessor of b′ . Conditions are formed
using Boolean program variables and logical operators.
Figure 11 describes our encoding of the effects of path conditions. Given variable v and basic blocks b and b′ , such that b is an
immediate predecessor of b′ , we define C(v, b, b′ ). M2L predicate
C(v, b, b′ ) represents a necessary condition for v to cause the transition from b to b′ . The definition uses function C ′ , which syntactically and recursively transforms the path condition into an M2L
predicate. In the figure, [|f, m|] (t, R, progv1 , . . . , progvn ) represents a predicate restricting R via a necessary condition on the m-th
parameter of Boolean method f , when f returns t. Such an abstraction should be predefined for each Boolean method, as in the case
of built-in functions. In the absence of an abstraction, we default
to true. def(v ′ ) represents an instruction, v ′ = f (v1 , · · · , vn ),
which defines program variable v ′ .
For a proof of Theorem 1, the reader is referred to the research
report [28], where the proof is done by induction on a transition
system that defines the semantics of the SSA program.
4. INDEX- AND PATH-SENSITIVITY
This section describes how to augment the core algorithm with
the index sensitivity and the path sensitivity.
4.1 Handling String Indices
An index is encoded as a position and position-set pair. For example, if string “ace” is encoded as position set {0, 2, 4} in M2L,
then index 1 into it is encoded as the pair (2, {0, 2, 4}). More
generally, we introduce the following M2L predicates to represent indices: pos0 (p, S) ≡ (p = min(S)), · · · , posn (p, S) ≡
posn−1 (p, S \ min(S)), where posn (p, S) means that position p
in position-set S represents index n into a string represented by S.
When encoding instructions, PISA accounts for indices following the rules in Figure 9, where a predicate progx takes a position
and position-set pair as its arguments. Those rules apply when the
left-hand-side variable in an instruction assumes a value representing an index into a string. Note that any numerical expression of
the form n + N can be encoded, where n is a variable and N is
a constant. However, since M2L cannot directly encode numerical
expressions of the form n + m, where m is also a variable, for such
expressions PISA over-approximate it by posany (p, S) = p ∈ S,
which represents an arbitrary index into a string represented by S.
This same encoding is also used for a cyclic variable.
With index sensitivity at its disposal, PISA can model string
operations such as indexOf and substring, where indexO
f(s1,s2) returns the first index in s1 at which s2 occurs, whereas
substring(s,n,m) extract from s the substring ranging between
indices n and m. These methods are abstracted as shown in Figure 7. The first and second parameters in the [|indexOf|] formula represent a position and a string containing it, respectively,
while V is the set of concrete values possibly assigned to s2 in
indexOf(s1,s2). Intuitively, indexOf(p, P, Q) holds if Q is a
substring of P starting at the index represented by (p, P ). By requiring p ≤ p′ , we choose the minimal index among the candidates
that satisfy indexOf(p, P, Q). In addition, due to the restriction
about the minimal index, we need a set V of concrete string values possibly assigned to s2 of indexOf(s1,s2). It should be
obtained by another analysis as in the case of [|replace|] 4 . Otherwise, [|indexOf|] involves the same problem as [|replace|].
As an example, consider the following SSA program fragment:
v0 = 0; v1 = "a<b";
v3 = indexOf(v1,v2);
[|x := v|]b → progx (R) ≡ “v”(R)
[|x := f (x1 ,· · ·, xn )|]b
→ progx (R) ≡ [|f |] (R, progx1 ∧C(x1 , b),· · ·, progxn ∧C(xn , b))
b
[|x := phi(b1 : x∨
1 ,. . ., bn : xn )|]
→ progx (R) ≡ i∈{1,···,n} progxi (R) ∧ C(xi , bi , b)
Figure 10: Abstraction for path-sensitive string analysis
C(v, b, b′ ) ≡ C ′ (true, v, P C ′ (b0 , b, b′ ))
C ′ (t, v, c) ≡

 λR . [|f, m|] (t, R, progv1 , . . . , progvn )


when c = f (v1 , . . . , vn ), and v = vm




 C ′ (t, v, def(v ′ )) when c = v ′ , where v ′ is a program variable



C ′ (true, v, c′ ) when c = ¬c′ and t = false

 ′
C (false, v, c′ ) when c = ¬c′ and t = true
C ′ (true, v, c1 ) ∨ C ′ (true, v, c2 ) when c = c1 ∨ c2 and t = true


 C ′ (true, v, c ) ∧ C ′ (true, v, c ) when c = c ∧ c and t = true

1
2
1
2



 C ′ (false, v, c1 ) ∧ C ′ (false, v, c2 ) when c = c1 ∨ c2 and t = false


′
′

 C (false, v, c1 ) ∨ C (false, v, c2 ) when c = c1 ∧ c2 and t = false

λR . true
otherwise
v2 = "<";
v4 = substring(v1,v0,v3);
We obtain the following set of predicates after expanding the definitions of [|indexOf|] and [|substring|]:
progv0 (n, S) ≡ pos0 (n, S)
progv1 (R) ≡ “a<b”(R)
progv2 (R) ≡ “<”(R)
progv3 (p, P ) ≡ progv1 (P )∧(∃P2 .“<”(P2 )∧indexOf(p, P, P2 ))
∧(min(P ) ≤ p ⇒ (∀P2 , p′ .“<”(P2 ) ∧ indexOf(p′ , P, P2 )
∧min(P ) ≤ p′ ⇒ p ≤ p′ ))
4 If the minimal index is not required, the abstraction becomes simpler so as
not to require the concrete strings, but makes the analysis too conservative
to check the existence of unsafe characters.
Figure 11: Constraint on program-variable v when the execution transitions from basic block b to basic block b′
171
void
f1(String v1) { String s1 = f3(v1); }
void
f2(String v2) { String s2 = f3(v2); }
String f3(String v3) { return v3; }
void detectSanitizers(
Set<Method> M, Set<Pattern> P, // input
Set<Pair<CGNode,Pattern>> R) { // output
CallGraph cg = callgraphOf(M);
for (Method m : M) {
Set<CGNode> N = nodes(cg, m);
for (CGNode n : N) {
Set<Instruction> I = instructionsOf(n, cg);
Set<Variable> V = returnVariablesOf(n);
for (Pattern p : P) {
boolean r = doStringAnalysis(I, V, p);
if (r) R.add(new Pair(n, p)); } } } }
Figure 12: Sample Java Program for Interprocedural Analysis
Figure 10 uses C(v, b, b′ ) to encode ϕ instructions. The variables used by the ϕ statement are each constrained by taking the
relevant immediate predecessor of b′ into account. For the other instructions, we do not need to consider the immediate predecessor.
Thus,
we simply use the notation C(x, b′ ), which is equivalent to
∨
′
′
b∈pred(b′ ) C(x, b, b ), where pred(b ) represents a set of immedi′
ate predecessors of basic block b .
With path-sensitivity, we can abstract contains, as described
in Figure 7. In [|contains,1|], S is the set of concrete strings
assigned to v2 . Note that ¬(∃P . progv2 (P ) ∧ substr(P, R))
cannot be used when c = false since the analysis is conservative.
As an example, consider the following Java method, along with its
corresponding SSA representation, where op(or) and op(neg)
represent logical disjunction and negation, respectively:
Figure 13: Outline of the Sanitizer-detection Algorithm
sanitizer-detection algorithm, and we then present the implementation of that algorithm. We then discuss two sets of evaluations: In
the first set, we investigate how many user-defined sanitizers in the
application code of 8 open-source benchmarks PISA was able to
detect. In this first set of evaluations, we used two variants of PISA,
and we compared PISA to alternative algorithms. In the second set,
we examine the overall impact of PISA on a production-level taintanalysis algorithm.
String clean(String s) { 1: v1 = "<"; v2 = ">";
b1 = contains(s,v1);
if (s.contains("<") ||
b2 = contains(s,v2);
s.contains(">")) {
z0 = op(or)(b1,b2);
s = "x";
z1 = op(neg)(z0);
}
jump z1,3;
return s;
}
2: v4 = "x";
3: v5 = phi(1:s,2:v4);
return v5;
6.1
Sanitizer Detection
Our algorithm for detecting sanitizers consists of two steps. The
first step is to find sanitizer candidates based on a syntactic check:
An input pattern ranging over method signatures is used to focus
the analysis on methods that are likely to act as sanitizers (e.g.,
methods accepting a single String argument and returning a St
ring object). Figure 13 shows the algorithm of the second step.
This phase consumes the set M of candidate methods and the set P
of unsafe string patterns, and outputs the set R of the pairs of the
form (n,p) such that the method corresponding to callgraph node
n is a sanitizer for unsafe pattern p, where each pattern is used to
categorize detected sanitizers. Note that sanitizer categorization is
essential for taint analysis, since a method is typically a sanitizer
only for certain types of attack. The procedure comprises the following steps:
1. callgraphOf(M) builds the callgraph cg rooted at the set
M of sanitizer candidates. Any callgraph-construction algorithm
can be chosen to build cg (e.g., context insensitive or context
sensitive, with various levels of context sensitivity [14, 13]).
2. For each sanitizer candidate, we obtain the set of callgraph nodes
representing it.
3. For each node n, a set I of instructions is then computed using
function instructionsOf(n), which returns the set of instructions in n and in all the nodes transitively called by n. In
addition, for each caller/callee pair, assignment relationships between actual parameters in the caller node and formal parameters
in the callee node are included in I. This allows us to perform
the interprocedural analysis described in Section 5.
4. For each unsafe pattern p, return variables of node n are extracted by returnVariablesOf(n), and then verified by
the string analysis, doStringAnalysis(I,V,p), where the
string analysis returns true iff the set of potential string values taken by the return variables V of node n never contains the
unsafe pattern p. If the string analysis returns true, then n is
reported as a sanitizer for the unsafe string pattern p.
For this method, we obtain the path conditions: P C ′ (1, 1, 2) =
b1∨b2, P C ′ (1, 1, 3) = ¬(b1∨b2), and P C ′ (1, 2, 3) = b1∨b2.
The resulting predicates are:
C(v4, 2, 3) = C ′ (true, v4, P C ′ (1, 2, 3)) = λR . true
C(s, 1, 3) = C ′ (true, s, P C ′ (1, 1, 3))
= λR . [|contains, 1|] (R, false, progs , progv1 )
∧λR . [|contains, 1|] (R, false, progs , progv2 )
= λR . ¬(∃P . “<”(P ) ∧ substr(P, R))
∧λR . ¬(∃P . “>”(P ) ∧ substr(P, R)).
5. INTERPROCEDURAL ANALYSIS
Our interprocedural version of the string analysis relies on a callgraph, where each node of the callgraph contains a set of instructions in the SSA form and these instructions are translated into
a set of M2L predicate declarations using our encoding method.
The relationships among callgraph nodes are used to obtain possible assignment relationships between caller’s program variables
and callee’s program variables (parameters and return variables).
The assignment relationships are encoded in M2L as if those are
assignment instructions.
Let us consider a context-insensitive callgraph for three Java’s
methods shown in Figure 12, where the callgraph has three nodes
n1 , n2 , and n3 for the methods f1, f2, and f3, respectively. We
obtain the four assignment relationships v3=v1, v3=v2, s1=v3,
s2=v3. These assignment relationships are then encoded in M2L
using the encoding method described in Section 3.4, where, due
to two possible assignment to v3, the first two relationships can
be encoded in the same way to encode the ϕ-instruction v3=ph
i(v1,v2).
Note that this approach simply ignores the call stack. Therefore, context-sensitivity of the callgraph affects the precision of the
string analysis.
6.2
Implementation
We implemented the algorithm described in Section 6.1 as a Java
program using the T. J. Watson Libraries for Analysis (WALA)
framework [32], which calls MONA [16] as a command-line program to check the satisfiability of M2L formulae. In addition, the
implementation embodies the following sound optimizations:
6. IMPLEMENTATION AND EVALUATION
We applied PISA to a production-level taint-analysis engine for
the purpose of automatic detection of user-defined sanitizers (sanitizers defined in application code). In this section, we describe our
172
Table 1: Statistics on benchmark applications
App. name
SBM
Blojsom
PersonalBlog
Roller
SnipSnap
Webgoat
JSPWiki
MVNForum
Total
Version
1.08
3.1
1.2.6
0.9.9
1.0-BETA-1
5.1
2.6
1.0.2
Classes
143
255
69
283
614
193
503
820
LOC
5,541
13,967
5,317
41,589
46,962
33,906
81,301
142,954
Table 2:Accuracy results of the sanitizer-detection experiment
Candidates
15
51
7
56
52
43
91
56
371
SBM
CFG
PSA/CI
PISA/CI
PISA/CS
Blojsom
CFG
PSA/CI
PISA/CI
PISA/CS
PersonalBlog CFG
PSA/CI
PISA/CI
PISA/CS
Roller
CFG
PSA/CI
PISA/CI
PISA/CS
SnipSnap
CFG
PSA/CI
PISA/CI
PISA/CS
Webgoat
CFG
PSA/CI
PISA/CI
PISA/CS
JSPWiki
CFG
PSA/CI
PISA/CI
PISA/CS
MVNForum CFG
PSA/CI
PISA/CI
PISA/CS
Total
CFG
PSA/CI
PISA/CI
PISA/CS
• String constants whose length exceeds fixed-size n are overapproximated by a disjunctive regular expression that ranges over
a partitioning of the original string, where each partition (except,
maybe, the suffix) is of size n. Our experiments used n = 5.
For example, “longstringvalue” is over-approximated by
regular expression “(longs|tring|value)+”.
• The verification of each method has a 30-second time limit, since
verifying all of the methods within a reasonable time is more
important than verifying a particular method for a long time. If
the limit is reached without having concluded the analysis, we
conservatively over-approximate the behavior of the method by
concluding that it is not a sanitizer.
• We did not define the last-position term , $, of M2L(Str), when
simulating M2L(Str) in MONA, since it was not required by our
encoding method.
• The path-condition analysis is run only if the SSA form of the
target method consists of less than 50 basic blocks. Otherwise,
the analysis is still sound because it conservatively becomes pathinsensitive.
XSS
TP FN Score
0 3 0
3 0 100
3 0 100
3 0 100
0 8 0
0 8 0
0 8 0
0 8 0
0 4 0
1 3 25
1 3 25
1 3 25
2 10 17
4 8 33
4 8 33
4 8 33
1 3 25
4 0 100
4 0 100
4 0 100
5 16 24
14 7 67
14 7 67
14 7 67
5
6
6
8
3
3
3
3
16
35
35
37
HRS
TP FN Score
0 3 0
3 0 100
3 0 100
3 0 100
1 2 33
1 2 33
1 2 33
1 2 33
0 3 0
0 3 0
0 3 0
0 3 0
2 3 40
2 3 40
2 3 40
2 3 40
1 3 25
4 0 100
4 0 100
4 0 100
7 9 44
7 9 44
7 9 44
9 7 56
5
6
6
8
4
4
4
4
20
27
27
31
LOG
TP FN Score
0 3 0
3 0 100
3 0 100
3 0 100
0 2 0
0 2 0
0 2 0
0 2 0
0 3 0
0 3 0
0 3 0
0 3 0
2 3 40
2 3 40
2 3 40
2 3 40
1 3 25
4 0 100
4 0 100
4 0 100
5 8 38
5 8 38
5 8 38
7 6 54
5
6
6
8
3
3
3
3
16
23
23
27
PATH
TP FN Score
0 3 0
3 0 100
3 0 100
3 0 100
1 4 20
1 4 20
2 3 40
2 3 40
0 3 0
0 3 0
0 3 0
0 3 0
2 3 40
2 3 40
2 3 40
2 3 40
1 7 13
4 4 50
5 3 63
5 3 63
6 9 40
6 9 40
8 7 53
10 5 67
6
6
6
8
3
3
3
3
19
25
29
33
Total
TP FN Score
0 12 0
12 0 100
12 0 100
12 0 100
2 16 11
2 16 11
3 15 17
3 15 17
0 13 0
1 12 8
1 12 8
1 12 8
8 19 30
10 17 37
10 17 37
10 17 37
4 16 20
16 4 80
17 3 85
17 3 85
23 42 35
32 33 49
34 31 52
40 25 62
21
24
24
32
13
13
13
13
71
110
114
128
puting invariants. This computation is done by iterating the grammar transduction and by using the character-set approximation [8,
23] as a widening operation.
Table 2 shows how many sanitizers were detected by PISA on the
8 benchmark applications in comparison with the CFG-based string
analyzer. If a method automatically detected by PISA is a true sanitizer, the method is counted as a true positive (TP). Otherwise,
it is a false positive (FP). If PISA fails to detect a true sanitizer,
that result is counted as a false negative (FN). The score is calculated as 100 × TP/(TP + FP + FN). Note that the false negatives
are counted only for the relatively small 6 benchmark applications,
since detecting false negatives requires manually reviewing all of
the candidates to find the true sanitizers that are not detected by
PISA—a very time-consuming and error-prone operation. In addition, Table 2 shows only true positives and false negatives, since
the sanitizer detection reported no false positives. Note that false
positives in the sanitizer detection are caused only when the string
analysis is not conservative and fails to infer unsafe strings that
arise at runtime.
Table 3 shows running time and the statistics reported by MONA.
The Abort column shows the total time spent analyzing candidates
that were aborted due to either the time limit we set in our analysis or a size limit in MONA. The number of aborted candidates is
shown in parentheses. For PISA, the Enc and Ver columns show the
time spent on translating the callgraph into a MONA program and
the running time of MONA, respectively. For the CFG-based analysis, the Enc column shows the time spent on translating the callgraph into a set of production rules, and the Ver column shows the
total time spent on inference and containment checks. The MONA
Statistics column shows summaries of the statistics reported by
MONA. The Pred and DAG columns show the average/maximum
numbers of generated predicates and DAGs (graph representations
of formulae), respectively. The State and BDD columns show the
average and maximum numbers of elements in the largest sets of
states and Binary Decision Diagram (BDD) nodes of minimized
6.3 Evaluation
We ran our experiments on top of the Sun Java Runtime Environment (JRE) V1.6.0_06 with 1 GB of maximum heap size using a
Lenovo ThinkPad T61p with a Core2 Duo T7800 2.6GHz CPU and
3 GB of RAM. Statistics on the benchmark applications we used
are shown in Table 1. The Candidates column reflects the number
of methods in application code (i.e., ignoring imported libraries and
JUnit test classes) that accept a single parameter of type String
and return a String value. This is the criterion we used for identifying sanitizer candidates. (Note that in some cases, methods that
were not intended to be used as sanitizers may be accepted by both
the criterion we just described and the ensuing analysis.)
Our study focused on the following four types of attack: XSS,
HRS, Log Forging (LOG) and Path Traversal (PATH) [2]. The
corresponding regular-expression patterns used in the specification
are “.*[<>].*” (HTML tags), “.*[\r\n].*” (strings of multiple
lines), “.*[\r\n\x08].*” (strings of multiple lines and backspace)
and “.*\.\./.*” (strings representing relative paths), respectively.
6.3.1 Sanitizer Detection
Our experiment on sanitizer detection comprised four configurations: The first two, PISA/CI and PISA/CS, both embody the
PISA algorithm, the difference between them being that the former
is based on a context-insensitive (0-CFA) callgraph, whereas the
latter relies on a context-sensitive (1-CFA) callgraph [14, 13]. Recall from Section 5 that PISA depends on its underlying callgraph
in various ways, and thus the overall accuracy of PISA derives,
in part, from the precision of its supporting callgraph. The two
remaining candidates are PSA/CI, which is a variant of PISA/CI
employing only path sensitivity (and thus lacking the abstract models for indexOf and lastIndexOf), and the CFG-based string
analysis of [12], which is based on [23]; this analysis is neither
path- nor index-sensitive, but can handle cyclic variables by com-
173
static final String PUNCTUATION_CHARS_ALLOWED
= " ()&+,-=._$";
static String cleanLink(String link){
return cleanLink(link, PUNCTUATION_CHARS_ALLOWED);
}
static String cleanLink(String link,
String allowedChars){
if (link == null) return null;
link = link.trim();
StringBuffer clean=new StringBuffer(link.length());
boolean isWord = true; boolean wasSpace = false;
for (int i = 0; i < link.length(); i++){
char ch = link.charAt(i);
if (Character.isWhitespace(ch)) {
if (wasSpace) continue;
wasSpace = true;
} else { wasSpace = false; }
if (Character.isLetterOrDigit(ch)||
allowedChars.indexOf(ch) != -1) {
if (isWord) ch = Character.toUpperCase(ch);
clean.append(ch); isWord = false;
} else { isWord = true; }
}
return clean.toString();
}
private static String getFileName(String s) {
String fileName = new File(s).getName();
if (fileName.indexOf("/") != -1) {
fileName =
fileName.substring(fileName.lastIndexOf("/"),
fileName.length());}
if (fileName.indexOf(".") != -1) {
fileName =
fileName.substring(0,fileName.indexOf("."));}
return fileName;
}
public String getCcnParameter(String name){
return getRegexParameter(name, "\\d{16}");
}
public String getPhoneParameter(String name){
return getRegexParameter(str, "[\\d\\s-]+");
}
private String getRegexParameter(String name,
String regexp) {
String param = getStringParameter(name,regexp);
if (Pattern.matches(str,regexp)) return str;
else return "";
}
Figure 14: Sanitizers detected by PISA
WED) in the link using the branch condition. Method getFileN
ame checks for the existence of the illegal characters, and extracts
the substring between “/” and “.” using the methods substrin
g, indexOf, and lastIndexOf.
In terms of efficiency, in our experiments PISA/CI was almost
4.2 times slower than the CFG-based string analyzer. According to
our observations, the reason is that PISA/CI has to deal with branch
conditions as well as integer and string values. However, when only
the verification time was considered, PISA/CI was only 1.4 times
slower than the CFG-based string analyzer. Also, the CFG-string
analyzer directly uses the automata given as the specification for
checking the inferred CFGs, but PISA/CI generates M2L predicate
declarations from the regular expressions, and MONA interprets
those predicates each time PISA/CI verifies a sanitizer candidate.
Table 3: Running Times and MONA Statistics
Time(sec)
SBM
CFG
PSA/CI
PISA/CI
PISA/CS
Blojsom CFG
PSA/CI
PISA/CI
PISA/CS
Personal-CFG
Blog
PSA/CI
PISA/CI
PISA/CS
Roller CFG
PSA/CI
PISA/CI
PISA/CS
SnipSnapCFG
PSA/CI
PISA/CI
PISA/CS
Webgoat CFG
PSA/CI
PISA/CI
PISA/CS
JSPWiki CFG
PSA/CI
PISA/CI
PISA/CS
MVN- CFG
Forum PSA/CI
PISA/CI
PISA/CS
Total
CFG
PSA/CI
PISA/CI
PISA/CS
MONA Statistics
Average
Maximum
Abort Enc Ver Total PredDAGState BDD Pred DAG State BDD
0.0(0) 0.1 0.3 0.4
0.0(0) 2.0 6.9 8.9 32 278 187 2,316 134 748 923 11,352
0.0(0) 2.0 6.7 8.7 32 278 187 2,316 134 748 923 11,352
0.0(0) 2.0 6.8 8.8 32 278 187 2,316 134 748 923 11,352
0.0(0) 1.8 7.9 9.7
30.3(1) 4.1 17.9 52.3 18 197 49 481 54 615 923 10,679
30.2(1) 4.2 18.3 52.7 18 204 49 473 71 841 922 10,647
120.3(4) 89.1 19.6 229.0 35 255 55 542 370 1,119 922 10,647
0.0(0) 0.3 0.2 0.5
0.0(0) 0.9 1.9 2.8 23 215 32 235 77 519 129 1,182
0.0(0) 0.9 1.9 2.8 23 215 32 235 77 519 129 1,182
0.0(0) 0.7 2.0 2.7 23 215 32 235 77 519 129 1,182
0.0(0) 4.1 4.5 8.6
0.0(0) 13.2 32.4 45.6 17 222 205 2,912 68 983 8,193124,926
0.0(0) 13.1 33.0 46.1 17 224 205 2,912 68 983 8,193124,926
0.0(0) 18.0 31.4 49.4 17 221 206 2,956 59 594 8,193124,926
0.0(0) 1.8 3.3 5.1
0.0(0) 9.7 20.5 30.2 25 234 129 1,411 126 679 1,908 22,571
0.0(0) 9.9 21.0 30.9 25 247 128 1,403 129 679 1,908 22,571
0.0(0) 21.2 22.6 43.8 30 262 145 1,601 241 865 1,908 22,571
0.0(0) 1.4 17.1 18.5
0.0(0) 8.5 33.7 42.2 91 420 75 814 683 1,893 633 8,002
0.0(0) 9.2 34.3 43.5 92 434 75 821 684 1,893 633 8,002
0.0(0) 8.4 32.7 41.1 72 407 73 803 596 2,401 633 8,002
0.0(0) 2.0 23.5 25.5
80.5(10) 396.0 36.7 513.2 41 271 97 1,317 493 1,3981,539 36,079
80.7(10) 394.8 36.7 512.2 41 272 97 1,318 493 1,3981,539 36,079
152.5(9) 388.3 43.8 584.6 43 286 122 1,595 487 1,8762,064 36,079
30.6(1) 1.3 92.0 123.9
44.7(1) 11.4 59.6 115.7 25 229 89 1,423 124 980 1,406 50,894
64.7(2) 11.2 23.9 99.8 25 234 70 698 124 980 552 6,305
47.2(1) 18.5 59.5 125.2 29 253 94 1,525 144 1,0581,406 50,894
30.6(1) 12.8 148.8 192.2
155.5(12)445.8209.6 810.9
175.6(13)445.3175.8 796.7
320.0(14)546.2218.41084.6
PISA/CI versus PSA/CI. In Blojsom, SnipSnap, and Web-
goat, PISA/CI detected 4 sanitizers for PATH vulnerabilities that
were not detected by PSA/CI, compared to 6 sanitizers for PATH
vulnerability that were not detected by the CFG-based string analyzer but detected by PSA/CI. We observed that index-based string
operations were used for replacing or removing unsafe substrings
for XSS and HRS, but the indices were calculated by loops and/or
numerical expressions (e.g., n + m, where n and m are variables)
that cannot be encoded in M2L.
PISA/CI versus PISA/CS. For Webgoat and JSPWiki, some
true-positive sanitizers detected by PISA/CS were not detected by
PISA/CI. Overall, PISA/CS detected 128 sanitizers in the 8 applications. The method getCcnParameter of Figure 14 is the
simplified version of ParameterParser.getCcnParamete
r of Webgoat, which was detected as a sanitizer by PISA/CS, but
which was not be detected by PISA/CI. The context-sensitive callgraph can distinguish among the callers of getRegexParamete
r. Thus, PISA/CS can determine that the return values of getR
egexParameter called by getCcnParameter matched only
the regular expression "\\d16". This allowed PISA/CS to detect
getCcnParameter as a sanitizer for HRS. In contrast, PISA/CI
could not detect getCcnParameter as a sanitizer for HRS since
getRegexParameter is called by both getCcnParameter
and getPhoneParameter and PISA/CI determined that the return values of getRegexParameter matched either "\\d16" or
"[\\d\\s]+". For the same reason, we had other true sanitizers
that were not detected by PISA/CI, but were detected by PISA/CS.
automata, respectively. Note that the number of the predicates is
almost equivalent to the number of instructions analyzed by PISA.
PISA/CI versus CFG-based String Analyzer. We summa-
rize the results for the 8 Web applications in the Total row, showing
that PISA/CI detected and categorized 114 sanitizers compared to
72 sanitizers detected by the CFG-based string analyzer.
Figure 14 shows several true sanitizers that were successfully
detected by PISA, but not detected by the CFG-based string analyzer. Here, we consider method cleanLink, defined in class
MarkupParser of JSPWiki, and method getFileName, defined in the class Course of Webgoat, mentioned above. The purpose of cleanLink is to keep only legal characters (letters, numbers, and characters specified by PUNCTUATION_CHARS_ALLO
174
final String entities[] = {"<", ">"};
final String refs[] = {"&lt;", "&gt;"};
String cleanByLoop(String s) {
for (int i = 0; i < entities.length; i++)
s = s.replaceAll(entities[i], refs[i]);
return s;
}
Table 4: Results of the taint-analysis-integration experiment
w/o pre-defined sanitizers
w/ predefined sanitizers
w/o PISA
w/ PISA
w/o PISA
w/ PISA
(Configuration A) (Configuration B) (Configuration C) (Configuration D)
XSSHRSLOGPATH XSSHRSLOGPATH XSSHRSLOGPATH XSSHRSLOGPATH
SBM
Blojsom
PersonalBlog
Roller
SnipSnap
Webgoat
JSPWiki
MVNForum
Total
String removeNonLetter(String str) {
String ret = "";
char[] cs = str.toCharArray();
for (int i = 0; i < cs.length; i++)
if (Character.isLetter(cs[i]))
ret = ret + cs[i];
return ret;
}
Figure 15: Sanitizers not detected by PISA
118
1
1
5
50
8
27
108
4 0
5 97
0 3
0 9
6 10
1 7
35 91
10 2
651
4 115 4 0
14 1 4 94
0
1 0 3
0
4 0 9
8 50 6 10
4
8 1 7
19 27 15 88
4 102 10 2
609
4 118 1 0
10 1 4 84
0
1 0 3
0
3 0 9
7 50 6 10
4
8 1 5
19 25 14 78
4 70 9 2
539
4 115 1 0
14 1 4 81
0
1 0 3
0
2 0 9
3 50 6 10
4
8 1 5
10 25 14 75
2 64 9 2
515
4
10
0
0
2
4
7
2
a team of security experts), and (c) using the most complete specification we can obtain (configuration D). The numbers in Table 4
confirm that this effect is significant: While configuration A yields
651 reports, of which 136 are ruled out as false findings by configuration D, configuration B ruled out 42 reports overall, which
represents 31% of the improvement gained by configuration D in
elimination of false reports, and 38% of the improvement gained
by configuration C. This means that thanks to PISA, the user can
still boost the accuracy of the security tool considerably and in a
fully automated fashion, without requiring any time or expertise
from the development team. In particular, PISA can account for
close to 40% of the benefit from the considerable effort invested by
a security team devising a manual specification.
Limitations and False Negatives. The false negatives, which
we found manually, were mainly caused by these limitations.
• The method cleanByLoop on Figure 15 should be a sanitizer
for XSS since it never returns a string value containing < or >.
However, neither PISA nor the CFG-based string analyzer can
detect it as a sanitizer, even though the CFG-based string analyzer can handle the loop. This is because the resulting CFG inferred by the CFG-based string analyzer contains a string value
that comes directly from the value of the parameter s. To solve
this problem, we might have to unroll the loops while propagating the constant string values. Other examples of the same problem include these methods that were not detected as sanitizers:
ParameterParser.htmlEncode, Screen.convertMe
tachars, and HtmlEncoder.encode.
• PISA’s path-sensitivity relies on constraints on individual local
variables that are directly checked by built-in Boolean functions.
Due to this limitation, method removeNonLetter in Figure
15, which is similar to Macros.removeNonAlphanumer
ic in Roller, could not be detected as a sanitizer since PISA
cannot determine that cs[i] used in the condition is the same
as cs[i] used in the then block.
• We experimented only with 0-CFA and 1-CFA. However, we
would need n-CFA (n>1) to detect some sanitizers in the benchmark applications. (E.g.: Macros.escapeHTML in Roller).
• Both PISA and the CFG-based string analyzer did not have a
complete set of abstractions for built-in string operations (E.g.:
java.text.SimpleDateFormat.format called by methods in PersonalBlog).
7.
RELATED WORK
Many string-analysis algorithms have been presented to date.
Java String Analyzer (JSA) was first introduced by Christensen, et
al. [8]. JSA approximates a string value by a regular language, allowing for statically checking errors in dynamically generated SQL
queries. According to the online manual [7], the latest version of
JSA has a form of path sensitivity through assertions, which is similar to ours. However, any experimental results of the path sensitivity has not yet been presented. Minamide [23] proposed approximating string values with a CFG and modeling built-in string
operations using transducers to check the well-formedness of dynamically generated HTML documents. Wassermann and Su [33]
extended Minamide’s algorithm to syntactically isolate unsafe substrings from safe substrings in PHP programs. Their string analyzer
can also account for regular-expression matches, but their paper
does not mention how to deal with branch conditions that consist
of string comparisons and Boolean operators as well as regularexpression matches. Yu, et al. [36] proposed another string- analysis algorithm for PHP, in which built-in string operations are modeled by the standard operations and a newly introduced replacement operation on automata. They also used the MONA’s automaton package to implement these operations, and their approach was
augmented with a backward analysis [35]. Kieżun, et al. [19] presented HAMPI, a string-constraint solver based on quantifier-free
bit-vector logic. HAMPI does not have any syntax to explicitly
mention positions. It is designed to treat fixed-size CFGs as well as
regular grammars as constraints. Hooimeijer and Weimer [17] presented a decision procedure for solving constraints on regular languages, and applied the proposed decision procedure to infer input
parameters that create SQL injection vulnerabilities. Fu, et al. [11]
proposed another constraint solver based on a variation of the word
equation, in which string-replacement operations were modeled using finite-state transducers. As a whole, none of these papers addressed the need for index-sensitivity, and only JSA has a form of
path sensitivity.
There are a few papers that combine string analysis with symbolic execution. Bjørner, et al. [5] proposed to use word equations for checking the feasibility of paths generated by a dynamic
6.3.2 Integration with Taint Analysis
To gain insight on the impact of PISA on the overall precision of
the security scanner, we integrated PISA/CS into the taint-analysis
algorithm used by the IBM Rational AppScan [1]. Table 4 lists the
results we obtained in terms of the number of vulnerable locations
(call sites of the sinks) reported by the scanner.
For this study, we weakened our criterion for identifying sanitizer candidates by lifting the requirement that the method’s input
be a single string argument. (We used this more liberal criterion to
guarantee that the candidate filter does not eliminate too many real
sanitizers due to their signature.) Consequently, PISA/CS detected
2423 methods as sanitizers. We evaluated four different configurations, each corresponding to a particular combination of whether or
not PISA is used and whether or not the set of predefined sanitizers
provided as part of the AppScan algorithm is used.
Note that the richer the sanitizer specification is, the more accurate the taint analysis becomes. The four configurations we defined thus allow us to appreciate the effect of using only an automatically generated specification (configuration B) compared to the
other three alternatives of (a) using no specification at all (configuration A), (b) using only a manual specification (as authored by
175
symbolic-execution engine Pex [29], while handling the same kind
of index sensitivity. In this work, the path constraints are checked
by the SMT solver Z3 [37], which is also used by Rex [31]. Saxena,
et al. [25] recently proposed Kudzu that is a symbolic execution
engine combined with a string constraint solver covering bit-vector
logic and word equation. However, in the core language models
used by the above two techniques, no string-replacement functions,
which is essential for making Web applications secure, were not
statically modeled. Shannon, et al. [26] proposed to treat methods like indexOf in their symbolic execution engine by modeling
convertion between symbolic strings and symbolic integers. Compared to their approach, our approach could be more precise, since
it simultaneously treats string constraints and index constraints without any conversions.
As for sanitizer detection, Balzarotti, et al. proposed to use a
string analysis for improving the accuracy of their sanitizer detection algorithm, but did not mention how to improve the string analysis algorithm itself except that they used taint propagation.
Various static taint analyses [22, 30, 21] were proposed to date.
However, those did not make any use of string analysis, and relied on specifications of sanitizers without any guarantee that the
sanitizers configured into the static taint analyzers were correct or
that the specifications themselves were complete. Our work can
contribute to finding application-specific sanitizers.
Checking satisfiability of an M2L formula could be implemented
by exploiting other solvers such as a SAT solver or a SMT solver
through constructing a bounded model of M2L(Str) [3] , or symbolic representations of automata [31].
[7] A. S. Christensen, A. Feldthaus, and A. Møller. JSA – the Java String
Analyzer. brics.dk/JSA, 2009.
[8] A. S. Christensen, A. Møller, and M. I. Schwartzbach. Precise
analysis of string expressions. In SAS, 2003.
[9] P. Cousot and R. Cousot. Formal language, grammar and
set-constraint-based program analysis by abstract interpretation. In
FPCA, 1995.
[10] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K.
Zadeck. Efficiently computing static single assignment form and the
control dependence graph. TOPLAS, 1991.
[11] X. Fu and C.-C. Li. A string constraint solver for detecting web
application vulnerability. In SEKE, 2010.
[12] E. Geay, M. Pistoia, T. Tateishi, B. Ryder, and J. Dolby. Modular
string-sensitive permission analysis with demand-driven precision. In
ICSE, 2009.
[13] D. Grove and C. Chambers. A Framework for Call Graph
Construction Algorithms. TOPLSA, 2001.
[14] D. Grove, G. DeFouw, J. Dean, and C. Chambers. Call graph
construction in object-oriented languages. In OOPSLA, 1997.
[15] C. Hammer, R. Schaade, and G. Snelting. Static path conditions for
java. In PLAS, 2008.
[16] J. G. Henriksen, J. L. Jensen, M. E. Jørgensen, N. Klarlund, R. Paige,
T. Rauhe, and A. Sandholm. MONA: Monadic second-order logic in
practice. In TACAS, 1995.
[17] P. Hooimeijer and W. Weimer. A decision procedure for subset
constraints over regular languages. In PLDI, 2009.
[18] M. Kay and R. M. Kaplan. Regular models of phonological rule
systems. Computational Linguistics, 20(3), 1994.
[19] A. Kieżun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst.
HAMPI: A solver for string constraints. In ISSTA, 2009.
[20] N. Klarlund and A. Møller. MONA Version 1.4 User Manual. BRICS,
2001. Notes Series NS-01-1. http://www.brics.dk/mona.
[21] B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee. Merline:
Specification inference for explicit information flow problems. In
PLDI, 2009.
[22] V. B. Livshits and M. S. Lam. Finding security vulnerabilities in java
applications with static analysis. In USENIX Security, 2005.
[23] Y. Minamide. Static approximation of dynamically generated web
pages. In WWW, 2005.
[24] B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global value
numbers and redundant computations. In POPL, 1988.
[25] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and
D. Song. A symbolic execution framework for javascript. In Security
and Privacy (Oakland), 2010.
[26] D. Shannon, I. Ghosh, S. Rajan, and S. Khurshid. Efficient symbolic
execution of strings for validating web applications. In DEFECTS,
2009.
[27] G. Snelting. Combining slicing and constraint solving for validation
of measurement software. In SAS, 1996.
[28] T. Tateishi, M. Pistoia, and O. Tripp. Path- and index-sensitive string
analysis based on monadic second-order logic. IBM Research Report
RT0930, 2011.
[29] N. Tillmann and J. D. Halleux. Pex: white box test generation for
.NET. In TAP, 2008.
[30] O. Tripp, M. Pistoia, S. Fink, M. Sridharan, and O. Weisman. TAJ:
Effective taint analysis of web applications. In PLDI, 2009.
[31] M. Veanes, P. de Halleux, and N. Tillmann. Rex: Symbolic regular
expression explorer. Microsoft Research Technical Report
MSR-TR-2009-137, 2009.
[32] T. J. Watson Libraries for Analysis, wala.sf.net/.
[33] G. Wassermann and Z. Su. Sound and precise analysis of web
applications for injection vulnerabilities. In PLDI, 2007.
[34] M. N. Wegman and F. K. Zadeck. Constant propagation with
conditional branches. TOPLAS, 1991.
[35] F. Yu, M. Alkhalaf, and T. Bultan. Generating vulnerability
signatures for string manipulating programs using automata-based
forward and backward symbolic analyses. In ASE, 2009.
[36] F. Yu, T. Bultan, M. Cova, and O. Ibarra. Symbolic string
verification: An automata-based approach. In SPIN Workshop, 2008.
[37] Z3, research.microsoft.com/projects/z3.
8. CONCLUSION
In this paper, we presented a novel string-analysis technique,
which enables unprecedented precision when modeling string operations, thanks to the combination of path and index sensitivity.
Our string analysis is conducted by encoding the program in M2L,
and relies on the satisfiability checking of an M2L formula. Our
technique is motivated by the need for effective security analysis
of Web applications, where a robust procedure for detecting and
verifying sanitizers is essential. Our evaluation of the proposed
approach shows it to compare favorably to the CFG-based string
analysis of [12] (discovering 128 vs. 71 sanitizers), and have a significant impact on its client taint analysis’ precision.
Acknowledgements
We thank Anders Møller and Yinnon Haviv for their invaluable
comments and advice on a preliminary version of this paper, Naoshi
Tabuchi for his help on our earlier work, the developers of MONA,
WALA, and IBM AppScan for providing some of the building blocks for this work, and the ISSTA reviewers for their helpful comments.
9. REFERENCES
[1] IBM Rational AppScan Source Edition. ibm.com/software/r
ational/products/appscan/source.
[2] Open Web Application Security Project (OWASP).
owasp.org/index.php/Category:Attack.
[3] A. Ayari and D. Basin. Bounded model construction for monadic
second-order logics. In CAV, 2000.
[4] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda,
C. Kruegel, and G. Vigna. Saner: Composing static and dynamic
analysis to validate sanitization in web applications. In Security and
Privacy (Oakland), 2008.
[5] N. Bjørner, N. Tillmann, and A. Voronkov. Path feasibility analysis
for string-manipulating programs. In TACAS, 2009.
[6] D. Brumley, H. Wang, S. Jha, and D. Song. Creating vulnerability
signatures using weakest preconditions. In CSF, 2007.
176