A General Approach for Efficiently Accelerating Software-based Dynamic Data Flow Tracking on Commodity Hardware

A"General"Approach"for"Efficiently"
Accelera3ng"So6ware8based"Dynamic"Data"
Flow"Tracking"on"Commodity"Hardware"
Kangkook"Jee""
Columbia"University"
Joint"work"with"
Georgios"Portokalidis1","Vasileios"Kemerlis1",""
Soumyadeep"Ghosh2,"David"August2,"Angelos"Keromy3s1"
"
1Columbia"University,"2Princeton"University"
1"
Talk"Outline"
•  Data"Flow"Tracking"[VEE"12]"
–  Popular"Subject"of"Security"Research"
–  Tagging"and"tracking"of"Interes3ng"data"
•  Taint"Flow"Algebra"(TFA)"[NDSS"12]"
–  Intermediate"Representa3on(IR)"for"DFT"
–  Compiler"op3miza3on"+"DFT"specific"op3miza3on"
•  ShadowReplica"
–  Decoupling"of"execu3on"and"monitoring"
–  Running"each"from"different"cores"
2"
Talk"Outline"
•  Data"Flow"Tracking"[VEE"12]"
–  Popular"Subject"of"Security"Research"
–  Tagging"and"tracking"of"Interes3ng"data"
•  Taint"Flow"Algebra"(TFA)"[NDSS"12]"
–  Intermediate"Representa3on(IR)"for"DFT"
–  Compiler"op3miza3on"+"DFT"specific"op3miza3on"
•  ShadowReplica"
–  Decoupling"of"execu3on"and"monitoring"
–  Running"each"from"different"cores"
3"
Data"Flow"Tracking"(DFT)"
•  A"great"security"tool"with"many"applica3ons"
–  Tag"input"data"and"track"them"
–  So6ware"exploits,"Informa3on"misuse"or"leakage"
malware"analysis"…"
•  Implementa3on"approaches"
–  Hardware"assisted:"Raksha,"RIFLE"…"
–  Source"code"based:"GIFT"…"
–  Binary"only:"TaintCheck,"Dytan,"Minemu,"Libd6"…"
"Binary"only"DFT:"Most"promising,"but"too"slow!"
4"
DFT:"Basic"Aspects"
•  DFT"is"characterized"by"three"aspects""
(1)  Data'Sources:"program"or"memory"loca3ons"where"data"
of"interest"enter"the"system"and"is"subsequently"tagged"
(2)  Data'tracking:"process"of"propaga3ng"data"tags"
according"to"the"program’s"seman3cs"
(3)  Data'Sinks:"program"or"memory"loca3ons"where"checks"
for"“tagged”"data"can"be"made"
Shadow Memory
(3)
(1)
Input
File
Network
Keyboard
Source
mov eax, [ebx]
(2)
mov [esi], eax
Data Tracking
Output
File
Network
5"
DFT"Opera3on"
Real Memory
•  Real"Memory"="Address"space"+"register"context"
6"
DFT"Opera3on"
Real Memory
Shadow Memory
•  Real"Memory"="Address"space"+"register"context"
•  Shadow"memory"to"track"metadata"update"""
7"
DFT"Opera3on"
dst[idx1] = src[idx0];
Real Memory
Shadow Memory
•  Memory"copy"statement"from"the"original"execu3on"
8"
DFT"Opera3on"
dst[idx1] = src[idx0];
Real Memory
t(dst[idx1]) = t(src[idx0]);
Shadow Memory
•  Memory"copy"statement"from"the"original"execu3on"
•  Corresponding"shadow"memory"update"
9"
DFT"Opera3on"
dst[idx1] = src[idx0];
mov reg0 ← [src+idx0]
Real Memory
t(dst[idx1]) = t(src[idx0]);
mov [dst+idx1] ← reg0
Shadow Memory
•  Original"opera3on"translated"into"machine"code""
•  It"requires"intermediate"register"repository"(reg0)"
10"
DFT"Opera3on"
dst[idx1] = src[idx0];
mov reg0 ← [src+idx0]
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
Real Memory
t(dst[idx1]) = t(src[idx0]);
mov [dst+idx1] ← reg0
Shadow Memory
•  Instruc3on"level"instrumenta3on"to"implement"
shadow"update"
11"
DFT"Opera3on"
dst[idx1] = src[idx0];
mov reg0 ← [src+idx0]
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
Real Memory
t(dst[idx1]) = t(src[idx0]);
mov [dst+idx1] ← reg0
mov reg0 ← [t(reg0)]
mov [t(dst+idx1)] ← mov reg0
Shadow Memory
•  2"original"instruc3ons"+"4"tracking"instruc3ons"
•  2"instrumenta3on"units"
12"
Why"So"Slow?"
•  Framework"cost"(virtualiza3on"cost)"
–  DBI,"Hypervisor"instrumenta3on"
•  DFT"cost"
–  Accesses"to"shadow"storage"
•  Naïve"Implementa3on"
–  No"understanding"of"global"context"
–  No"understanding"of"DFT"seman3cs"
13"
Talk"Outline"
•  Data"Flow"Tracking"
–  Popular"Subject"of"Security"Research"
–  Tagging"and"tracking"of"Interes3ng"data"
•  Taint"Flow"Algebra"(TFA)"
–  Intermediate"Representa3on(IR)"for"DFT"
–  Compiler"op3miza3on"+"DFT"specific"op3miza3on"
•  ShadowReplica"
–  Decoupling"of"execu3on"and"monitoring"
–  Running"each"from"different"cores"
14"
Taint"Flow"Algebra"
•  Applica3on"specific"analysis"
•  DFT"specific"analysis"
•  Integrated"with"libd)*
–  High"performance"DFT"tool""[VEE"2012]""
•  1.46x"~"8x"slowdown"(over"na3ve"execu3on)"
–  Designed"for"use"with"Pin*DBI*framework*"
–  Open"source"
•  hsp://www.cs.columbia.edu/~vpk/research/libd6"
15"
Op3mizing"DFT"
Original"
mov reg0 ← [src+idx0]
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
Instrumenta3on"
mov [dst+idx1] ← reg0
mov reg0 ← [t(reg0)]
mov [t(dst+idx1)] ← mov reg0
•  Each"Instrumenta3on"unit"requires"head/tail"instruc3ons"
•  t(*)*:"shadow"memory"access"cost""
16"
Op3mizing"DFT"
Original"
mov reg0 ← [src+idx0]
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
Instrumenta3on"
mov [dst+idx1] ← reg0
mov reg0 ← [t(reg0)]
mov [t(dst+idx1)] ← mov reg0
•  Re8locatable"
17"
Op3mizing"DFT"
mov reg0 ← [src+idx0]
mov [dst+idx1] ← reg0
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
mov reg0 ← [t(reg0)]
mov [t(dst+idx1)] ← mov reg0
•  Less"instrumenta3on"units"(2!1)"
18"
Op3mizing"DFT"
mov reg0 ← [src+idx0]
mov [dst+idx1] ← reg0
mov reg0 ← [t(src+idx0)]
mov [t(reg0)] ← reg0
mov reg0 ← [t(reg0)]
mov [t(dst+idx1)] ← mov reg0
•  Less"instrumenta3on"units"(2!1)"
•  Less"tracking"instruc3ons"(4!2)"
19"
Execu3on"Model"
•  3"Components"
–  "Profiler,"Analyzer,"DFT"Run3me"
•  Sta3c/offline"analysis"+"Dynamic"run3me"
–  Feedback"loop"
Optimized data
tracking
DFT
Runtime
Control flow
information
Dynamic
profiler
Analyzer
Basic blocks
Unprocessed
basic blocks
Static
Profiler
20"
Analyzer"
•  Taint"Flow"Algebra"
–  Represent"binary"analysis"result"
–  IR"tailored"to"capture"DFT"seman3cs"
•  Compiler"op3miza3on"to"TFA"
–  Inner"(intra)"basic"block:"
"Dead"code"elimina3on,"Algebraic"simplifica3on,"…"
–  Outer"(inter)"basic"block:""
"Data"flow"analysis"
•  DFT"specific"considera3ons"
–  Valid"loca3on"for"each"instrumenta3on"unit"
–  Number"of"instrumenta3on"units"
21"
TFA"Op3miza3on"
1:#mov#ecx,#esi
2:#movzxb#eax,#al
3:#shl#ecx,#0x5
4:#add#edx,0x1
5:#lea#esi,#ptr#[ecx+esi]
6:#lea#esi,#ptr#[eax+esi]
7:#movzxb#eax,#ptr#[edx+esi]######
8:#testb#al,#al
9:#jnzb#0xb7890200
(a) x86 instruction
•  Per"basic"block"analysis"
•  Gray"instruc3ons:"non8tracking"instruc3ons"
22"
TFA"Op3miza3on"
1:#mov#ecx,#esi
2:#movzxb#eax,#al
3:#shl#ecx,#0x5
4:#add#edx,0x1
5:#lea#esi,#ptr#[ecx+esi]
6:#lea#esi,#ptr#[eax+esi]
7:#movzxb#eax,#ptr#[edx+esi]######
8:#testb#al,#al
9:#jnzb#0xb7890200
(a) x86 instruction
1:#ecx1#:=#esi0
2:#eax1#:=#0x1#&#eax0
3:#
4:#
5:#esi1#:=#ecx1#|#esi0
6:#esi2#:=#eax1#|#esi1
7:#eax2#:=#0x1#&#[edx0+esi2]
8:#
9:#
(b) TFA transformation
•  Translated"into"TFA"
•  Input"operands,"output"operands"
23"
TFA"Op3miza3on"
1:#mov#ecx,#esi
2:#movzxb#eax,#al
3:#shl#ecx,#0x5
4:#add#edx,0x1
5:#lea#esi,#ptr#[ecx+esi]
6:#lea#esi,#ptr#[eax+esi]
7:#movzxb#eax,#ptr#[edx+esi]######
8:#testb#al,#al
9:#jnzb#0xb7890200
(a) x86 instruction
1:#ecx1#:=#esi0
2:#eax1#:=#0x1#&#eax0
3:#
4:#
5:#esi1#:=#ecx1#|#esi0
6:#esi2#:=#eax1#|#esi1
7:#eax2#:=#0x1#&#[edx0+esi2]
8:#
9:#
1:#ecx1#:=#esi0
2:
3:#
4:#
5:#
6:#esi2#:=#0x1#&#eax0#|esi0
7:#eax2#:=#0x1#&#[edx0+esi2]
8:#
9:#
(b) TFA transformation
(c) TFA optimization
•  Output"operands"are"expressed"in"terms"of"input"
operands"
•  Data"flow"analysis"to"remove"irrelevant"outputs"
24"
TFA"Op3miza3on"
1:#mov#ecx,#esi
2:#movzxb#eax,#al
3:#shl#ecx,#0x5
4:#add#edx,0x1
5:#lea#esi,#ptr#[ecx+esi]
6:#lea#esi,#ptr#[eax+esi]
7:#movzxb#eax,#ptr#[edx+esi]######
8:#testb#al,#al
9:#jnzb#0xb7890200
|
&
eax0
esi2
eax1
0x1
&
&
esi0
esi1
[edx0+esi2]
eax2
0x1
ecx1
(a) x86 instruction
DAG Representation
•  DAG"Representa3on"
•  Express"root"nodes"in"terms"of"leaf"nodes"
25"
DFT"Run3me"
•  Generate/Inject"op3mized"tracking"code"to"
the"baseline"DFT"plaworm"
–  Translate"op3mized"TFA""
•  Our"prototype"extends"libd6""
•  Code"genera3on"of"libd6/PIN8aware"C"code"
–  A"func3on"per"each"instrumenta3on"unit"
–  e.g.,"Firefox:"50K"customized"func3ons""
26"
Evalua3on"
•  Op3miza3on"schemes"
–  Code"reduc3on:"Simple"dead"code"elimina3ons"
•  Inner,"Outer"
–  Code"genera3on:"Op3mized"tracking"codes"
–  TFA"Scaser,"TFA"Aggrega3on"
Category'
CFG'
Considera9on'
TFA'
Op9miza9on'
Aggrega9on'
Code"reduc3on" Inner"
No"
No"
No"
Outer"
Yes"
No"
No"
Scaser"
Yes"
Yes"
No"
Aggrega3on"
Yes"
Yes"
Yes"
Code"
genera3on"
Op9miza9on'
schemes'
27"
Evalua3on:"SPEC"CPU2000"
12
libdft
Inner
Outer
TFA scatter
TFA aggr
10.5
Slowdown (Normalized)
9
7.5
6
4.5
3
1.5
0
crafty
eon
gap
gcc
mcf
parserperlbmk twolf
vortex
vpr average
•  CPU"intensive"workloads"
•  TFA’s"speedup"over"libd6:""on"average"1.90x"(the"largest"2.23x)"
•  ~3x"slowdown"over"the"na3ve"execu3on"
"
28"
Evalua3on:"Server"applica3ons"
(a) MySQL
libdft
Inner
Outer
TFA scatter
TFA aggr
7
24
6
5
4
3
20
16
12
8
2
1
libdft
Inner
Outer
TFA scatter
TFA aggr
28
Slowdown (normalized)
Slowdown (normalized)
(b) PHP
create
alter
insert
ATIS
Test suite
4
casing
md5
sha1
average
PHPBench Benchmark
•  Mysql’s"own"benchmark"suite"(sql8bench)"and"PHP"micro"benchmark"suite"
(PHPBench)"
–  Plosed"representa3ve"subsets"
29"
Evalua3on:"Client"Applica3ons"
(a) Web site rendering
11
libdft
Inner
Outer
TFA scatter
TFA aggr
10
8
7
6
5
4
libdft
Inner
Outer
TFA scatter
TFA aggr
18
Slowdown (normalized)
9
Slowdown (normalized)
(b) Javascript
15
12
9
3
6
2
1
Gmail
NDSS
Youtube Facebook
Web site
• 
firefox
chrome
Browser
Rendering"measurement"for"Alexa’s"Top"500"sites"and"NDSS"2012"site"
–  For"Firefox"web8browser"
• 
Dromaeo"(hsp://www.dromaeo.com)"Javascript"benchmark"suite"
–  For"Firefox"and"Google"Chrome"web8browser"
30"
Talk"Outline"
•  Data"Flow"Tracking"[VEE"12]"
–  Popular"Subject"of"Security"Research"
–  Tagging"and"tracking"of"Interes3ng"data"
•  Taint"Flow"Algebra"(TFA)"[NDSS"12]"
–  Intermediate"Representa3on(IR)"for"DFT"
–  Compiler"op3miza3on"+"DFT"specific"op3miza3on"
•  ShadowReplica"
–  Decoupling"of"execu3on"and"monitoring"
–  Running"each"from"different"cores"
31"
ShadowReplica:"Parallelized"DFT"
•  Basic"idea"
–  Offload"tracking"overhead"to"
the"separate"core"
Application
•  Challenge"
–  Communica3on"cost"is"too"
high"
–  Synchroniza3on"
•  Approach"
–  Minimize"the"overhead"to"the"
primary"
–  Maximize"the"u3liza3on"of"the"
secondary"
Collect
Events
Ring"buffer"
Enqueue
Dequeue
Analyze
32"
ShadowReplica:"Execu3on"Model"
•  The"block"0x1234""
The"Primary"
Block"ID:"
0x1234"
–  Executed"from"the"primary""
–  Monitored"from"the"
secondary"
Application
Collect
Events
Ring"buffer"
Enqueue
Dequeue
Analyze
The"Secondary"
=="Block"ID:"0x1234=="
ecx1":="esi0"
esi1":="[ebp0"–"0x4"]"
edx1":="[ebp0"–"0x14]"
esi2":="0x1"&"eax0"|"esi1"
eax2":="0x1"&"[edx1"+"esi2]"
33"
ShadowReplica:"Run3me"Informa3on"
The"Primary"
Block"ID:"
0x1234,""
[ebp0"8"0x4],
[ebp0"–"0x14],""
[edx1"+"esi2]"
•  The"Primary"enqueues"
–  Effec3ve"Memory"
Addresses"
–  Block"Iden3fiers"
Application
Collect
Events
Ring"buffer"
Enqueue
Dequeue
Analyze
The"Secondary"
=="Block"ID:"0x1234"=="
ecx1":="esi0"
esi1":="[ebp0'–'0x4']'
edx1":="[ebp0'–'0x14]'
esi2":="0x1"&"eax0"|"esi1"
eax2":="0x1"&"[edx1'+'esi2]'
34"
ShadowReplica:"Run3me"Informa3on"
The"Primary"
Block"ID:"
0x1234,""
[ebp0"8"0x4],
[ebp0"–"0x14],""
[edx1"+"esi2]"
•  The"Primary"enqueues"
–  Effec3ve"Memory"
Addresses"
–  Block"Iden3fiers"
Application
Collect
Events
Ring"buffer"
Enqueue
Dequeue
Analyze
The"Secondary"
=="Block"ID:"0x1234"=="
ecx1":="esi0"
esi1":="[ebp0'–'0x4']'
edx1":="[ebp0'–'0x14]'
esi2":="0x1"&"eax0"|"esi1"
eax2":="0x1"&"[edx1'+'esi2]'
35"
Op3mizing"Effec3ve"Addresses"
•  Intra"Block"Op3miza3on"
–  Linear"Lower"Bound"
•  3"Register"Variables:""
eax0,"eax1,"ebx0"
•  4"Memory"Operands:""
One"redundancy"
Block"ID:"0x1234"
![eax0!+!1!×!ebx0]
![eax1!+!2!×!ebx0]
![ebx0!+!4!×!eax1]
![eax0!+!2!×!ebx0]!
36"
Op3mizing"the"Primary:""
Effec3ve"Addresses"
Block"ID:"0x1234"
•  Intra"Block"Op3miza3on"
![eax0!+!1!×!ebx0]
![eax1!+!2!×!ebx0]
![ebx0!+!4!×!eax1]
![eax0!+!2!×!ebx0]!
–  Linear"Lower"Bound"
•  3"Register"Variables:""
eax0,"eax1,"ebx0"
•  4"Memory"Operands:""
One"redundancy"
•  Inter"Block"Op3miza3on"
–  Data"Flow"Analysis"
–  Use8Def"equa3on"based"on"
Inferablity*
BB1
[eax^ + ebx^]
[ebx^]
[eax^]
BB2
[eax0 + ebx0]
*"‘^’"indicates"top8most"version"for"the"register"variable"in"the"block"
BB0
37"
Op3mizing"the"Primary":"
Block"Iden3fiers"
•  Minimize"Block"ID"trace"
leveraging"
Block 0
10000
Block 1
10000
Block 2
9999
Block 3
1
Block 4
–  Control"Flow"Graph"(CFG)"""
–  Execu3on"count"
•  Only"Block"4"require"
instrumenta3on"
–  1"instrumenta3on"&"1"
execu3on"
–  Rest"can"be"restored"
from"the"secondary"
38"
Op3mizing"Block"Iden3fiers"
•  Minimize"Block"ID"trace"
leveraging"
Block 0
10000
Block 1
10000
Block 2
9999
Block 3
1
Block 4
–  Control"Flow"Graph"(CFG)"""
–  Execu3on"count"
•  Only"Block"4"require"
instrumenta3on"
–  1"instrumenta3on"&"1"
execu3on"
–  Rest"can"be"restored"
from"the"secondary"
39"
Preliminary"Evalua3on:""
Bzip2"Compression"
7.73x
7
Slowdown
6
5
4.11x
4
3.45x
2.5x
3
2.2x
2
1.27x
1
libdft
TFA
NO_OPT
INTRA
INTER
•  The"primary"performs"Bzip2"
compression"against"a"Linux"
kernel"
•  The"cost"of"dumping"EAes"
and"Block"trace"from"the"
primary"
•  Instrumenta3on"for"the"
primary"is"implemented""
using"PIN"DBI"
NULL_PIN
Implementations
•  NO_OPT:"All"EAes"and"BB"traces"
•  INTRA:"Intra"Block"op3miza3on"
•  INTER:"Intra"+"Inter"Block"op3miza3on"
40"
Candidate"Analyses"for"ShadowReplica"
Analyses"
Shadow"Memory" Data"Dependency"
Checking"
Data"Flow"Tracking"
✓"
✓"
✓"
MemCheck"
✓"
✓"
✓"
LockCheck"
✓"
✗"
✓"
Method"Coun3ng"
✗"
✗"
✗"
Call"Graph"Profiling"
✗'
✗'
✗'
Path"Profiling"
✗"
✗"
✗"
Cache"Simula3on"
✓"
✗"
✗"
•  Each"analysis"have"different"performance"implica3ons"
–  Instrumenta3on"frequencies,"size"of"analysis"rou3nes""
•  Data"dependency"across"mul3ple"updates"inhibits"the"paralleliza3on"of"
analysis"
•  Checking"opera3ons"requires"synchroniza3on"between"an"applica3on"and"an"
41"
analysis"
Conclusion"
•  Current"binary8only"DFT"implementa3ons"are"sub"
op3mal"
–  No"considera3on"for"DFT"seman3cs"
–  No"considera3on"for"global"context"
•  Proposed"a"novel"approaches"scheme"that"
–  Combines"sta3c"and"dynamic"analysis"
–  Segregates"execu3on"and"tracking"logic"
•  Huge"Speedups"for"real8world"applica3ons"
–  TFA:"~2x"
–  Shadow"Replica:"Constant"upper"bound"to"the"
overhead"
42"