[PDF]

Syntactic Difference Based Approach
for NTCIR-9 RITE Task
Yuta Tsuboi
Hiroshi Kanayama
Masaki Ohno
IBM Research - Tokyo, IBM Japan, Ltd.
1623-14 Shimotsuruma, Yamato-shi
Kanagawa, Japan
{yutat,hkana,moono}@jp.ibm.com
Yuya Unno
Preferred Infrastructure
2-40-1-4F Hongo, Bunkyo-ku
Tokyo, Japan
[email protected]
ABSTRACT
trees as the pair features to exploit syntactic dierences of
This paper describes the IBM team's approach for the textual entailment recognition task (RITE) in NTCIR-9 [10]
with experimental results for four Japanese subtasks: BC,
MC, EXAM, and RITE4QA. To tackle the data set with
complicated syntactic and semantic phenomena, the authors
used a classication method to predict entailment relations
between two dierent texts.
classication:
These features were used for
(1) Tree edit distance and operations, (2)
Word overlap ratios and word pairs, (3) Sentiment polarity
matching, (4) Character overlap ratios, (5) Head comparisons, (6) Predicate-argument structure matching, and (7)
a sentence pair. We also employed several kinds of the pair
features from a lexical similarity to the matching of temporal
expressions. We will explain these pair features in Section 4.
We experiment with our approach on the NTCIR-9 RITE
task in Section 5, and we will discuss the experimental results in Section 6.
We show that the use of operations
based on edit distance was eective for the MC subtask and
that the temporal expression matching worked well for the
EXAM subtask.
In Section 7, we will conclude our work and discuss future
directions.
Temporal expression matching. Feature (1) reects the operations in the edit distance computation between the text
and the hypothesis, which can capture the syntactic dier-
2.
SUPERVISED MACHINE LEARNING
ences between two sentences. In the RITE task, Feature (1)
Since the BC, EXAM, and RITE4QA subtasks are binary
is eective for the MC subtask and Feature (7) is eective
classication problems (true or false entailment) and the MC
for the EXAM subtask.
subtask is a multi-class classication problem, a supervised
machine learning approach is employed in these tasks. The
Keywords
pair of two dierent sentences (denoted as
NTCIR, RITE, entailment, machine learning, syntax, tree
sion
edit distance
pair feature space
[11] and
(LR) model is trained using labeled examples.
x ∈ X be the feature representation of a pair of S
T , y ∈ Y be an entailment label of a label set Y , and
ϕ(x, y) : |X| × |Y | be the cartesian product of x and a
Let
Team Name:
Subtasks:
represented in a
IBM
Japanese BC, MC, EXAM, and RITE4QA
External Resources: Bunrui-goi-hyo and Shibaki's ontol-
and
label assignment vector. LR model represents a conditional
probability
P(y|x)
in a
log -linear
ogy
Pθ (y|x) =
1. INTRODUCTION
Textual entailment is a very complex natural language
where
θ
form:
1
exp(θ ⊤ ϕ(x, y)),
Z
u
notes the inner product of the vectors,
for this task, textual entailment recognition requires detect-
the denominator is the partition function:
ing complex syntactic and semantic relations between two
dierent texts.
To exploit these relations, we train classi-
Z=
ers in the pair feature space [11], in which a text pair are
∑
In Section 2, we will explain our supervised
of the labeled data between dierent entailment tasks.
In Section 3, we propose to use the operations in the tree
edit distance computation between two dependency parse
and
v.
u⊤ v
de-
Note that
exp(θ ⊤ ϕ(x, y)).
y∈Y
Using a training data
machine learning approach. Since the number of the example is usually limited in the task, we propose the conversion
(1)
is the parameter vector of LR model and
phenomenon. Although supervised classiers have been used
represented.
S and T 1 ) are
a logistic regres-
E ≡ {(x, y)},
1
θ
log -
the parameter
can be estimated by the maximization of regularized
S and T correspond to T1 and T2 in the RITE data set,
respectively.
愛する
`love'
Table 1: Data conversion. The left columns denote
the original relations between S and T , and the right
columns denote the converted relations. The relation symbols Y and N denote true and false entailment, F, R, and B denote forward, reverse, bidirectional entailment, C and I denote contradiction and
independence, respectively.
MC relation
S
S
S
S
先生だ
`be teacher'
!aa
!!
a
彼が
`He-NOM'
(S )
!
!!
先生を
`teacher-ACC'
彼は
`He-TOPIC'
(T )
Figure 1: Edit distance minimization on trees
BC relation
F
→T
R
→T
B
→T
C
→T
I
Y
N
N
Y
S → T, T → S
Y
Y
S → T, T → S
N
N
S → T, T → S
N
N
S → T, T → S
BC relation
F, B
R,C,I
Y
N
S → T, T → S
S→T
t consist of morpheme sequences s = (ms1 , · · · , ms|s| )
t = (mt1 , · · · , mt|t| ), respectively.
MC relation
S→T
S −→ T
S→T
S −→ T
3.
(b) From BC data to
MC' data
(a) From MC data to BC'
data
and
SYNTACTIC DIFFERENCE
3.1
Edit distance
In this work, syntactic dierences between two sentences,
S
and
T , are represented by edit operations used in tree edit
distance calculation on the dependency trees.
The syntactic tree is constructed by the Japanese syntactic parser [4] which uses both the hand-crafted grammar and
likelihood:
∑
ln Pθ (y|x) +
(x,y)∈E
the statistical model created by a tagged corpus. The parser
||θ||2
,
2σ 2
(2)
Newton-CG
methods [5] and the hyper-parameter
σ
was
determined by grid search based on 5-fold cross-validation
(CV).
Since the number of the training data is limited in the
RITE task, we convert the training data for the MC subtask as the additional training data for BC subtask, and vice
versa. For this conversion, we assumed the interchangeability of the labels between BC and MC subtasks. By the data
conversion from the MC subtask to the BC subtask, two examples for the BC subtask are generated from one labeled
examples on the MC subtask since MC labels convey richer
entailment information than BC labels (see Table 1(a)). We
denote the combined training data of BC data and the examples generated from MC data as
S and T are represented
t ∈ t, respectively.
so that
where the nal term is a Gaussian prior on θ with mean
0 and variance σ 2 . In the experiments, θ was optimized
by
produces dependency structures between
BC+MC’
and
The tree edit distance between two trees
one example with ambiguous labels for the MC subtask are
S
and
T
is cal-
culated based on edit operations, each of which is one of
insertion, deletion, or substitution, in the similar way as
edit operations for strings.
an edit operation,
γ(s, t)
We dene a cost function for
on substituting
s
to
t.
Costs for
deletion and insertion operations can be denoted as special
cases;
γ(s, ϵ)
s and γ(ϵ, t) for
edit distance mapping
operations. Let D ⊆ s
appearing in M due to
the edit distance δ(s, t)
for the cost of deletion of node
the cost of insertion of a node
M ⊆ s × t represents
and I ⊆ t be the sets
t.
An
a set of edit
of nodes not
the existence in only one tree. Here
is dened as the minimum cost mapping [1]:
δ(s, t) = min
∑
M
γ(s, t) +
∑
γ(s, ϵ) +
s∈D
(s,t)∈M
data. By the
data conversion from the BC subtask to the MC subtask,
bunsetsus as nodes
s∈s
as trees consisting of
∑
γ(ϵ, t).
t∈I
Zhang and Shasha's algorithm [12] can calculate the edit dis2
2
tance δ(s, t) between s and t in O(|s| |t| ) time and space.
generated from one labeled examples on the BC subtask (Ta-
The edit distance computation between two syntactic trees
ble 1(b)). We denote the combined training data of MC data
corresponds to the word alignment between two sentences
and the examples generated from BC data as
MC+BC’ data.
because the substitution from
s
to
t
s
can be regarded as
the alignment of two nodes between
LR objective function (2). Let
′
Using the training data E ≡
tions on syntactic trees are advantageous compared to the
bels, we maximize the
∑
L ⊆ Y be a label candidates.
{(x, L)} with ambiguous laregularized marginal log -likelihood:
ln
(x,L)∈E ′
∑
||θ||
.
2σ 2
2
Pθ (y|x) +
y∈L
Newton-CG
Such opera-
word-to-word alignment between sentences as sequences of
bunsetsus. For example, suppose the sentences S and T are
彼が 先生を 愛する (`He loves a teacher') and 彼は 先生だ
(`He is a teacher'), respectively. When the edit distance is
calculated on the sequence of
bunsetsus,
the edit operations
with the lowest cost should be (1) substitution of 彼は to
Although this is non-convex optimization problem, we can
still use
and
t.
To deal with label ambiguities in BC' data, we extend the
method to nd a local-optimum. Note
彼が, (2)
生だ (`be
substitution of 先生を (`teacher-ACC') to 先
a teacher') and (3) deletion of 愛する (`love')
that, for MC+BC' data, we use a subset of MC data as the
with a rule that the replacement of functional words can be
validation set of CV.
done with a small cost, and then each pair of two common
In the sequel, we describe the feature representation of a
pair of
S
and
T.
be a map function from a pair of
x ∈ X.
s = (s1 , · · · , s|s| )
length |s| in S and t
Let
phrases of
content words 彼 (`he') and 先生 (`teacher') is connected
f (S, T )
in the two sentences. However, the semantic roles of 先生
to a feature space
(`teacher') are dierent in these two sentences, so this align-
be a sequence of
ment is not desirable to capture the dierence between two
be that of
bunsetsu
T , where s and
sentences.
We rst explain notation here. Let
S
and
T
In contrast, when the edit distance is calculated on the
dependency tree shown in Figure 1, the optimal operations
Semantic distance metrics using an ontology (Ontology).
are (1) substitution of 彼が (`he-NOM') to 彼は (`he-
We also attempt to measure a semantic similarity using
TOPIC'), (2) deletion of 先生を, (3) deletion of 愛する
another resource and combine it with
and (4) insertion of 先生だ, since the word 先生 appears
other cost function.
in dierent positions in
S
T,
and
and thus the tree
T
can
HDM
to dene an-
To measure a semantic similarity, we use an ontology auto-
not be generated with a substitution operation of the nodes
matically generated from Wikipedia with Shibaki's Method[9].
with 先生. This calculation successfully avoids the connec-
It denes
tion between two nodes with the word 先生 which behaves
up-to-date knowledge since it was constructed from Wikipedia
dierently in two sentences.
which is actively updated.
3.2 Cost functions
content word, and calculate their semantic similarity with
is-a relations
Given a pair of
Here we consider four types of cost functions: WO, BGH,
among about 730,000 words. It has
bunsetsus s
and
t,
we focus on each head
the shortest path length in the ontology. Let
spl(s, t)
be a
HDM and Ontology. These functions calculate the costs of
function which returns the shortest path length in the ontol-
substitution operations dierently, but the costs for deletion
ogy between the head content words of two input
i.e. γ(s, ϵ) = 1
in the all cases.
bunsetsus,
γ(ϵ, t) = 1,
then the following condition is newly inserted after the rst
The feature representation based on the
condition in the cost function of the HDM method dened
and insertion are set to one,
and
optimal edit operations will be shown in Section 4.
above.
Note that it is valid only when the head content
words of the both
Jaccard distance metrics using word overlap (WO).
{
γ(s, t) = 0.7 − 0.5 ·
Jaccard distance
bunsetsus s and t:
One of the cost functions is dened as
between the morpheme sets of two
|s ∩ t|
.
|s ∪ t|
γ(s, t) = 1 −
bunsetsus
4.
are found in the ontology.
1
spl(s,t)
a path found
PAIR FEATURES
In this section, we describe the representation of the edit
Semantic distance metrics using Bunrui-goi-hyou (BGH).
operations and other features used in the RITE task. The
names in parenthesis denote the IDs of feature sets which
Another method to calculate the cost is based on semantic
similarity between a pair of
bunsetsus.
are referred in the experimental results.
Here we use the the
Japanese thesaurus, extended version of
Bunrui-goi-hyo
[8],
Edit distance and operations (EDO).
The edit distance and operations in Section 3 are repre-
which assigns semantic codes around 230,000 words.
We focus on the head content words, when we calculate
semantic similarity between a pair of
bunsetsus.
A head
sented as elements of a feature vector. We use the normalized value of the edit distance
δ(s, t)/ max(|s|, |t|)
ranging
content word is dened as the rightmost content word in a
from 0 to 1. Although the value of the tree edit distance is
bunsetsu,
used as a feature in the previous work [6], we also use edit
and a content word is either a noun, a verb, an
adjective, an adverb, an interjection or their variants.
Given two
bunsetsus s
and
t,
we compare each code of
operations as the pair features. For each
D
and
I,
bunsetsu
in both
we use the base form and part-of-speech (POS)
cs and ct be the Bunruigoi-hyo codes and common(c1 , c2 ) is a common depth in the
′
thesaurus tree. e.g. common(c1 , c2 ) = 3 when cs = ‘3630
′
and ct = ‘36352 .
tag of each morpheme and these of the last morpheme as a
We dene a cost function as follows, giving a smaller cost
the last morpheme, and 4) the pair of the POS tags of the
head content words of them. Let
for more similar
representation of the
M,
in
bunsetsu.
For each pair of
bunsetsu s
we use 1) the sequence pair of the base forms, 2) the
sequence pair of POS tags, 3) the pair of the base forms of
last morphemes. Note that the granularity of the POS tags
bunsetsus:
described above is coase, such as noun or verb. We also used
1
γ(s, t) =
1 + common(cs , ct )
more ne POS tags, which are denoted as POS ne in the
experimental results.
The value of the edit operation features is either the num-
Heuristic distance metrics (HDM).
ber of the edit operations, denoted as Count, or binary
Another method to measure the tree edit distance is the
combination of BGH metrics and other conditions, by com-
value, denoted as Binary. To make sure its value ranges
from 0 to 1, Count is divided by
max(|s|, |t|).
paring head content words and parts-of-speech. We call this
Since the edit operation features varies depending on the
method HDM (Heuristic Distance Metrics), which is calcu-
cost functions dened in Section 3, we also employed the
lated as follows:
combination of multiple feature sets based on dierent cost


0
γ(s, t) = 0.7 − 0.05 · common(cs , ct )

1.0
functions to represent a pair of
same word
same POS
otherwise
where `same word' means that the canonical forms of the
head content words of
s
and
t
are the same, `same POS'
means that the parts-of-speech of the head content words
are the same, and
common()
is in the BGH part.
s
and
t.
Overlap ratios of words and word pairs (Word).
Let
T,
mS
and
mT
be the set of content words in
S
and
respectively. The value of word overlap ratio feature is
dened as
|mS ∩ mT |/|mT |. The word pair feature is all
(ms , mt )|ms ∈ mS , mt ∈ mT of the con-
the combination
tent words. We use only the word pair features appearing
more than once in the training data.
Table 2: Example of ‘Sentiment’ feature set.
sentence pairs
S
T
Table 3: Example of sentence pairs that activate the
feature fP AS .
features
PET(ポジトロン断層撮影)検査は、肺がん、
大腸がん、食道がんなど、ほとんどのがんの
診療に有効とされている。
`The PET (positron emission tomography)
is believed to be eective for the care of
most types of cancers such as ...'
fpol = (+, +)
fsame = 1
T
スーザン・トレスさんは脳死になった。
`Ms. Susan Torres became brain dead due to melanoma ...'
`Ms. Susan Torres became brain dead.'
日本で臓器移植法が施行されて7年以上になる。
`PET helps the care of cancers.'
山田洋次は映画監督です。
S
失われた10年は立ち遅れた反省や経験が生か
され、無駄でなかった。
`The organ transplantation law have been eective for
7 years in Japan.'
S
`Director Yoji Yamada is good as making
scenes of men's weeping.'
T
T
スーザン・トレスさんは極めて悪性度の高いがんの一種
メラノーマが脳に広がり、脳死になった。
PETはがんの診断に役立っている。
山田洋次監督は男泣きの場面を作るのがうまい。
S
S
`Yoji Yamada is a lm director.'
`The Lost Decade was not wasted because ...'
失われた10年は無駄だった。
`We learned nothing from the Lost Decade.'
fpol = (+, −)
fdif f = 1
fopp = 1
日本で臓器移植法は施行された。
T
fpol = (+, 0)
fdif f = 1
`The organ transplantation law became eective in Japan.'
The syntactic tree is converted to a set of predicate-argument
structures to examine whether
predicate type:
entation of the two sentences.
positive or negative polarity and
Intuitively if the
S
entails
T, T
S
has a
is likely to
covers all information in
S.
two types:
a
bunsetsu
led by a verb or an adjective
as a predicate, and zero or more postpositional phrases
Polarity matching with sentiment detection (Sentiment).
We also introduced a feature to conrm the semantic ori-
T
A predicate-argument structure used here is either of these
with a case marker as arguments
modification type: a modier bunsetsu and a modiee bunsetsu, such as adverbial modication
have the same polarity. We used an existing sentiment de-
For example, the sentence (3) is converted to the following
tection engine [3], which detects positive or negative clauses
set of the predicate-argument structures. (P1) is a predicate
in a sentence with high precision. The engine handles not
type, and (P2) and (P3) are examples of the modication
only subjective utterance but also facts that suggests posi-
type.
tive or negative attributes. In most of cases no polarity is
detected, and thus this feature is expected to mainly work
as a negative clue for the entailment when
S
and
T
彼は 大きな 駅へ ゆっくり 行った。
have op-
(`He went to a big station slowly.')
posite polarities. This feature set consists of the following
features:
fpol :
The combination of the polarities of
fsame :
Whether
S
and
T
S
and
T
has the same polarity (e.g. positive
(P1)
行く (彼, 駅)
(`go (he, station)')
(P2)
大きな ⟨ 駅 ⟩
(`big
(P3)
ゆっくり ⟨ 行く ⟩
vs. positive)
We use a feature
fdif f :
Whether
S
and
T
has the dierent polarities (e.g.
positive vs. neutral)
fopp :
Whether
S
and
T
has the opposite polarities (e.g. pos-
Table 2 shows examples of assignment of `Sentiment' features for some sentence pairs.
cS
and
dened as
cT
be the set of characters in
S
and
T,
re-
The value of character overlap ratio feature is
in a sentence, the rightmost
bunsetsu
in the Japanese language, usually conveys the main concept
of the sentence, and therefore we use a feature to exam-
bunsetsu of the two sentences are
bunsetsus are compared, both words
ine whether the head
the
same. When two
are
converted to their canonical form and the dierence in some
symbols such as commas is ignored.
of the fulllment test. The
fP AS =
T
fP AS =
only if the all of the predicate-argument structures in
S , and otherwise
p1 and p2 has the same
predicate and all of the arguments of p2 appear also in p1 .
For instance, a predicate-argument structure 行く (駅) is
0.
p1 subsumes
p2 mean that
subsumed by (P1) in the above example. Table 3 exemplies
sentence pairs in which
fP AS = 1.
This feature is expected
duction of hyponym and hypernym relations using WordNet
were attempted, however, few pairs of nouns in
S
and
T
matched the relations, so we gave up to use WordNet for
this future.
Head comparisons (Head).
bunsetsu
fP AS
⟨go⟩')
To improve the coverage of this subsumption, the intro-
|cS ∩ cT |/|cT |.
The head
(`slowly
to be a strong clue for the entailment.
Overlap ratios of characters (Char).
Let
1
⟨station⟩')
are subsumed by one of those in
itive vs. negative)
spectively.
(3)
Matching of temporal expressions (Temporal).
As proven by the successful of the GeoTime task in the
previous NTCIR [2], temporal information has an important
role in information retrieval, and actually many temporal
expressions appear in the development set of the data, especially in the EXAM subtask, and thus we added a feature set
to compare the temporal expressions in the sentence pairs,
with two features:
Fulfillment tests with predicate-argument structures (PAS). fmatch :
Temporal expressions appear in both
at least one pair has an overlap
S
and
T
and
funmatch :
Two temporal features appear in both
S
and
T
and none of the pairs have an overlap
Table 5: The confusion matrices in the BC, EXAM,
RITE4QA tasks.
where the overlap is determined based on the ranges of the
Year: N 年 (`the year of N ') is converted to the year range
[N, N ]. The Japanese calendar scheme is also covered,
for example, 昭和 50 年 is converted to [1975, 1975].
System
years, by using the following patterns:
N
System
N
reduce the width of the year range.
For example, when two sentences
expression 1620 年代 (`1620s') and
T
S
[1601, 1650],
(c) RITE4QA
include a temporal
F
R
B
C
I
F
87
7
7
4
5
R
6
89
12
0
3
Correct
B C
21 30
20
9
30 16
1
5
3
5
I
40
16
6
4
14
has 17 世紀前半 (`the
rst half of the 17th century'), their ranges are
and
(b) EXAM
Y
N
Correct
Y
N
65 521
41 337
Table 6: The confusion matrix in the MC task.
世紀 (`N th century') is converted to the year
range [100(N −1)+1, 100N ]. The suxes 前半, 後半
and some other variations such as 初頭 (`beginning')
Century:
Correct
Y
N
103
45
78 216
Y
N
(a) BC
年代 (`the decade from N ') is converted to the
year range [N, N + 9]. The suxes 前半 (`the rst
half ') and 後半 (`the latter half ') are also considered,
for example, 1920 年代前半 is converted to the year
range [1920, 1924].
Decade:
Y
N
Correct
Y
N
137 125
113 125
[1620, 1629]
fmatch is
respectively, and thus the feature
set to true since the year ranges overlap.
5. EXPERIMENTAL RESULTS
5.2
MC subtask
The LR models using EDO features show much better
performance than the LR model using all the features except EDO. Based on LR models trained on the the MC
We employ the sevral combinations of the cost functions
data, the formal-run accuracy averaged over all of the fea-
for edit distance described in Section 3 and the pair feature
ture sets includes EDO is 47.1% comparing with 35.9% of
sets described in Section 4. We basically select feature sets
the system without EDO. The best LR model achieve 51.6%
for the formal-run submissions by the average accuracies on
accuracy on the formal-run test data. Again, the LR mod-
5-fold cross-validation (CV) using the training data.
els trained using the MC data perform better than the LR
Table 4 shows the average accuracies on CV and the formalrun results, and Table 5 and 6 show the confusion matrices
of the systems using a feature set which show the best CV
models trained using the MC+BC' data.
5.3
EXAM subtask
accuracies is used. Although we evaluate 280 feature sets for
Most of the LR models using the matching of temporal
all the subtasks, because of the limitation of space, we only
expressions (Temporal) perform better than that not using
report the results of the submitted systems for the formal-
Temporal. The best LR models achieve 72.6% accuracy on
run, all of the feature sets using the HDM cost function, and
the formal-run test data.
the best feature sets at any performance measure in Table 4,
and report the confusion matrices of the feature set which
5.4
mark the best CV accuracy in each subtask. In Table 4, the
submitted system outputs are denoted by the super script
of accuracy values. Since we found and xed bugs after the
formal-run for BC and MC subtasks, we report the revised
scores obtained after the bug-x.
The top column names
denote test data, the second column names denote training
data for LR models, and the third column names denote performance measures where CV stands for the average accuracy (%) on 5-fold cross-validation, AC stands for accuracy
(%) on formal-run test data.
5.1 BC subtask
RITE4QA subtask
The LR model trained on the BC+MC' data using all
the pair features except EDO shows the best performance,
Accuracy=72.6% , TOP1=18.1%, and MMR5=29.0% . We
observed a inverse correlation or week correlation between
the CV accuracy on the BC or BC+MC' training data and
the accuracy or QA performance measures on the RITE4QA
formal-run data.
6.
DISCUSSION
The edit distance and operations (EDO) are signicantly
eective features in the MC subtask.
To put it more pre-
In terms of the formal-run accuracy, LR models trained on
cisely, the classication of forward (F), reverse (R), and bidi-
the BC data using edit distance and operations (EDO) based
rectional (B) entailment using those features is more accu-
on BGH cost function and most of the other pair features
rate than that without the edit distance and operations.
achieve relatively high accuracy and the best system achieve
Table 7 shows confusion matrices in the MC subtask when
56.0% using binary-valued edit operation features. The LR
either of the edit distance and operations are used. As an
models trained on the BC data perform better than the LR
example, Table 7(a) shows the result of the edit distance
models trained on the BC+MC' data. Comparing with the
and operations derived by the HDM function. Comparing
best system on the formal-run data, the best feature sets of
with Table 7(b), the number of the correct predications, the
CV on the training data show slightly worse performance
diagonal numbers of the matrices, of F, R, and B is larger
(52.4% for the BC data and 51.6% for the BC+MC' data).
in Table 7(a).
Table 4: Accuracies and QA performance scores on subtasks. CV stands for the average accuracy on 5-fold
cross-validation using training data, and AC stands for the accuracy on formal-run data. The super script of
accuracy values indicates the result of submitted system output:
a) IBM-JA-BC-01 (51.6), b) IBM-JA-BC-02 (52.6), c) IBM-JA-BC-03 (50.0),
d) IBM-JA-MC-01 (45.5), e) IBM-JA-MC-02 (51.1), f ) IBM-JA-MC-03 (44.8),
g) IBM-JA-EXAM-01, h) IBM-JA-EXAM-02, i) IBM-JA-EXAM-03,
j) IBM-JA-RITE4QA-01, k) IBM-JA-RITE4QA-02, and l) IBM-JA-RITE4QA-03.
Values in parenthesis are the submitted formal-run results of BC/MC subtasks before the bug-fix. Note that
the accuracy of IBM-JA-BC-02 with the bug is same as that without the bug.
Cost Function
None
HDM
WO
BGH
Ontology
HDM & WO
HDM & BGH
HDM & BGH
& WO
HDM &
Ontology &
WO
HDM & BGH
& Ontology &
WO
Test Data
Training Data
Performance Measure
Word + Sentiment + Char +
Head + PAS + Temporal
EDO Count, POS ne
+Word
+Sentiment
+Char
+Head
+PAS
+Temporal
EDO Binary, POS ne
+Word
+Sentiment
+Char
+Head
+PAS
+Temporal
EDO Count
+Word
+Sentiment
+Char
+Head
+PAS
+Temporal
EDO Binary
+Word
+Sentiment
+Char
+Head
+PAS
+Temporal
EDO Binary, POS ne + Word +
Sentiment + Char + Head + PAS
+ Temporal
EDO Binary, POS ne + Word +
Sentiment + Char + Head + PAS
EDO Binary + Word + Sentiment
+ Char + Head + PAS
EDO Count, POS ne + Word
+Sentiment + Char + Head +
PAS + Temporal
EDO Binary, POS ne + Word
EDO Count + Word + Sentiment
+ Char + Head + PAS
+Temporal
EDO Count, POS ne + Word +
Sentiment + Char
EDO Count + Word + Sentiment + Char + Head + PAS +
Temporal
EDO Count + Word + Sentiment
+ Char + Head + PAS
EDO Binary
+Word + Sentiment + Char +
Head
EDO Count, POS ne
+Word
EDO Count, POS ne + Word
+Sentiment + Char
+Head + PAS + Temporal
EDO Binary
EDO Count, POS ne + Word +
Sentiment + Char + Head + PAS
+ Temporal
EDO Binary, POS ne + Word +
Sentiment + Char + Head + PAS
+ Temporal
BC
CV
52.8
BC
MC
EXAM
BC+MC'
MC
MC+BC'
EXAM
AC CV AC CV AC CV AC CV AC
52.0 57.5 48.2 33.6 35.9 35.5 31.4 67.9 69.5
49.0 47.8
50.2 51.8
52.4 50.8
52.4 50.8
53.0 50.4
54.2 52.6
54.2 51.6
50.0 50.0
50.6 49.6
50.8 50.2
50.8 50.2
51.4 49.0
50.8 49.6
51.0 49.6
50.2 48.2
50.8 52.0
52.6 51.8
52.6 51.6
53.8 51.6
54.8 52.2a
54.6 51.8
51.2 49.8
50.8 51.4
52.0 52.0
52.4 52.4
52.6 51.8
53.2 52.2
53.0 52.6
48.8 50.2
64.1
62.4
61.7
62.5
62.2
62.2
62.3
63.8
64.5
64.6
64.4
64.1
64.0
64.3
63.9
60.8
60.9
61.9
61.8
62.0
62.0
65.0
64.3
63.8
64.0
63.9
64.1
64.0
64.1
47.2
48.8
47.4
48.0
49.4
49.2
49.4
46.6
47.4
46.0
48.8
47.6
47.4
50.0
48.8
48.6
47.6
47.0
47.4
46.4
46.8
47.6
47.4
48.6
49.0
47.8
50.2
50.2
48.0
49.1
49.1
48.9
48.0
49.1
49.1
49.1
48.6
48.6
48.9
49.3
49.8
49.8
49.8
49.1
49.1
47.5
48.2
48.0
48.2
48.2
45.2
47.5
47.0
45.9
49.3
48.0
49.3
48.4
48.2
51.1
49.5
48.6
49.1
48.9
48.6
44.1
46.1
45.0
45.7
44.8
44.8
44.8
47.0
50.0
48.9
48.9
48.6
48.6
48.6
45.2
45.2
44.3
45.5
45.7
45.7
45.2
48.0
49.4
55.0 63.7
54.8
46.8
45.9 48.3
51.8
56.0
64.2
53.6 46.1
46.6 45.2
51.6
53.8
51.8 62.2
52.2 62.3
49.2 49.3
49.6 49.3
50.7 49.9
48.4 49.5
50.2
50.8
55.0
50.2 64.3
52.4 61.4
48.4 49.1
48.2 48.2
45.7 49.0
48.4 49.2
45.2 61.5 62.4 46.7
46.1 62.5 67.4h 31.6k
54.8
52.8
52.4 61.7
50.6 62.8
47.2 48.2
48.4 50.0
48.4 49.2
46.8 52.4
46.4 68.3
47.0 64.1
55.0
51.6 62.6
47.4 47.3
48.2 49.9
54.0 52.6b 62.3
48.0 48.9
49.1 49.2
52.8
52.6
51.6 65.5
52.8 65.2
47.8
51.0
51.4
52.0
51.4
50.6
51.8
48.2
52.6
52.0
51.2
51.0
50.2
51.6
51.4
47.4 64.4
64.1
63.6
63.6
63.7
63.4
63.3
63.8
49.4
51.6
47.0c
47.4
47.6
48.4
49.0
49.2
48.6
46.1
46.4
61.8
62.0
61.5
62.7
62.4
62.4
65.8
62.0
38.5
29.5
32.0
31.5
31.0
30.8
30.9
54.5
46.9
44.8
44.8
51.2
44.9
44.7
37.8
28.7
30.1
30.3
30.2
31.6
28.2
55.2
53.6
52.3
53.0
52.6
51.7
51.8
53.6
13.9
10.0
9.5
9.3
9.8
8.8
10.3
11.9
14.7
14.7
14.7
11.5
14.2
14.2
12.9
10.3
9.0
9.3
9.6
9.1
9.1
10.1
10.6
10.6
11.5
12.5
11.5
11.5
14.3
43.9 60.9
59.0
45.5
42.0 60.7
60.0
45.6
62.3
67.4
72.2
29.6
30.3
46.4
62.5
62.1
62.1
64.1
63.7
63.7
69.1
60.1
61.1
61.7
61.9
61.5
61.5
62.3
61.9
61.3
61.3
62.5
62.5
62.5
68.3
60.5
61.5
61.1
61.5
61.9
61.9
63.5
63.3
69.1
59.5
67.4
67.0
67.6
67.6
67.6
72.2g
60.0
61.3
62.0
62.0
62.2
62.2
62.7
60.0
67.2
67.2
67.6
67.4
67.4
72.6
25.6
26.8
34.4
32.1
35.5
34.3
34.0
32.9
36.4
40.5
38.1
42.9
42.3
34.6
42.5
26.9
33.1
40.9
38.1
37.7
37.1
35.3
39.3
42.1
43.4
44.7
41.8
44.4
45.4
31.1
12.7
12.8
11.8
13.1
13.8
13.3
14.3
11.6
12.9
12.2
13.7
13.7
13.6
13.7
12.2
10.8
12.0
14.6
14.6
14.6
14.6
10.6
12.1
12.9
14.2
14.7
13.2
14.7
14.6
24.0
24.6
24.1
25.3
25.7
25.5
26.0
23.4
24.2
24.2
25.1
25.2
25.2
25.0
23.7
23.4
24.2
26.0
26.0
25.9
26.1
23.2
24.1
24.4
25.6
26.0
25.2
26.0
25.7
12.0
24.2
38.6
10.8
24.5
10.0
23.9
37.3
13.3
26.1
11.0
9.8
22.2
22.5
33.5
32.9
12.8
14.3
24.7
25.9
9.1k
25.0
21.7k
42.0
35.3
12.4
12.8
24.2
25.3
15.2
24.6
22.0
21.5
21.6
22.3
21.8
22.8
23.5
24.7
24.9
25.0
23.5
24.8
24.9
24.4
21.9
21.3
21.7
22.2
21.8
21.9
22.5
23.0
23.0
23.5
24.2
23.7
23.7
67.6
30.3
30.1
9.1
9.8
22.1
22.0
36.4
32.0
14.6
11.8
26.1
24.8
47.5 67.9
71.3
32.2
11.6
23.2
32.1
12.1
24.8
45.9 62.5
67.0
29.4
9.8
22.3
34.8
44.3 46.6 44.1 60.7
43.6 46.8 43.6d 61.3
60.0
60.9
54.5
52.6
11.9
11.5
13.1
8.7l
11.1
22.2l
24.0
61.1 31.5
65.8 28.7
67.0 31.4
66.7 30.7
71.5 33.3j
60.9 55.9
72.6
39.8
11.1
11.5
11.5
10.3
11.3j
10.6
10.5
22.7
22.7
22.7
22.5
23.3j
22.8
23.5
29.5
28.4
28.5
30.6
30.9
42.7
28.6
13.0
11.8
12.3
11.3
14.3
8.9
12.6
24.1
23.8
23.8
24.4
26.0
22.1
24.8
14.0
25.1
43.2
10.8
23.3
49.3 48.0
50.2 50.2e
51.1 51.6
52.3
47.3
50.0 48.2
46.6 45.5
51.4 48.6
48.4 47.5
49.6 47.5
50.2 49.5
51.5 48.0
51.3 47.3
49.5 45.9
49.2 46.1
49.5 45.9
47.9 40.9
48.8 44.1
50.9 45.2
50.8 44.5
51.3 45.7
51.3 44.5f
51.1 44.5
48.4 47.7
48.8 50.0
50.2 46.4
50.2 47.7
48.8 46.4
49.2 46.1
49.2 46.4
47.7 43.2
47.3 45.0
49.1 44.3
49.3 45.7
49.1 45.0
49.1 44.8
49.1 44.8
47.9 45.7
RITE4QA
BC
BC+MC'
AC TOP1 MMR5 AC TOP1 MMR5
34.5
5.5
19.8 63.5 18.1
29.0
49.0
50.9
51.1
52.4
51.5
45.9
50.8
45.7 48.8
47.7 61.1
49.8 62.7
50.2 62.1
47.3 64.7
47.5 67.1
45.0 59.9
47.3 66.7
72.6
41.8 61.5 58.4i
49.3
23.8 40.1l
24.2 41.9
25.0
F
81
3
9
11
6
F
R
B
C
I
Correct
R B C
5 26 29
88 18 11
13 28 14
0
2
5
4
1
6
I
40
18
5
5
12
F
59
17
16
12
6
(a) With edit distance and
operations
Correct
R B C
28 47 23
60
6 12
9 16 13
3
5 10
10
1
7
I
25
26
7
9
13
(b) Without edit
distance and operations
EXAM
|s| − |t|
4.68
0.44
6.17
9.11
HDM
7.54
6.43
11.64
12.47
BGH
8.52
7.47
12.59
13.58
WO
8.32
7.26
12.82
13.49
Ontology
7.53
6.42
11.64
12.47
バジル、セージ、タイムなどフレッシュハーブの消費量は
ウナギのぼりである。
T
ハーブの消費が増えている。
`The consumption of fresh herbs such as basil, a sage,
and thyme is rapidly increasing.'
`The herbs consumption is increasing.'
今回の事件では、男子学生が「ハロウィーンの夜に
やってやる」と知人に漏らすなど、決行日を選んだ上で、
事件を起こしていたことが明らかになっている。
T
`In this incident, it is clear that
the boy students chose the day to act in advance
and said to his acquaintance,
"I am going to act at the Halloween night".'
男子学生は決行日を10月31日に選んだ。
`The boy student decided to act on October 31. '
Table 10: The confusion matrices in the EXAM subtask when either of Temporal feature value is true.
RITE4QA
System
MC
S
S
Table 8: The average difference number of bunsetsus
and the average edit distance of each cost function
on the test data of each subtask.
BC
Table 9: Examples of sentence pairs containing
phrases with similar meanings.
Y
N
(a)
Correct
Y N
21
4
3
4
System
System
Table 7: The confusion matrix in the MC task when
either of the edit distance and operations are used.
fmatch = 1
(b)
Y
N
Correct
Y
N
0
0
1 29
funmatch = 1
However, as described in Section 5, the EDO features are
not eective in the other subtasks. One reason of the eectiveness in the MC subtask could be the simplicity of the
sources and the result of the morphological analysis.
syntactic dierences observed in the MC data. As an index
example, the ontology contained グスタフ・シュトレーゼ
of the simplicity of the syntactic dierences, the dierence
マン
bunsetsus between S and T on
|s| − |t| at the second row of Table 8.
in the number of
test data,
denoted as
Note that
|s| − |t|
(`Gustav Stresemann') but our system splited it into
three morphologies, グスタフ (`Gustav') and ・, シュト
レーゼマン
(`Stresemann').
To increasing the coverage of
The
the detection of similarities, it is important to use the same
of MC data is nearly zero (0.44) and much
unit for resource construction and morphological analysis
these numbers are similar to those at training data.
value of
For
smaller than that of other data.
Since our cost functions
for insertion and deletion are simply set to one and the cost
or to adjust the result of the morphological analysis to the
resource unit.
functions for substitution are less than or equal to one, our
Our system with the two resources did not recognize some
design of cost function prefers substitutions and tends to
phrase relations with similar meanings. Table 9 exemplies
overlook possible important insertions and deletions. How-
two sentence pairs which have phrases with similar mean-
S
and
ings. In the former pair, the key to analyze their relation is
are similar, the edit distance computation with few inser-
to recognize ウナギのぼりである (`be rapidly increasing')
tions and deletions could be a natural alignment to exploit
is similar in meaning to 増えている (`be increasing'). The
syntactic relations of two sentences. On the other hand, the
other pair suggests the need of the knowledge that Halloween
syntactic dierences in the other data may be much compli-
is October 31.
cated than that in the MC data. This complexity of syntac-
constructed on various perspectives for the recognition of
tic dierences is also a possible explanation why LR models
phrase-level correspondences.
ever, if the number of nodes and the tree structures of
T
using the converted data, BC' and MC', do not perform well
on the dierent subtasks.
It is important to arrange some resources
The feature set `Temporal' worked very well in the EXAM
subtask, due to the frequent appearance of the expressions of
One remark is that, although we design four dierent cost
the year, decade or century in the EXAM data. Table 10(a)
functions, WO, BGH, HDM, and Ontology, the dierence of
and Table 10(b) show the relations of the system outputs
the resulting edit distances are small as shown in Table 8.
and the correct tags when the temporal expressions matched
And we could not observe superiority of any cost function
together (fmatch
in the nal entailment classications.
did not overlap (funmatch
= 1)
and when the temporal expressions
= 1),
respectively. These results
We compared two resources used for measuring seman-
show that the mismatch of the temporal expressions will be a
tic similarity based cost function, the ontology constructed
strong hint for the non-entailment relation. This is because
T
from Wikipedia and Bunrui-goi-hyo. Although the ontology
often
has a larger vocabulary than Bunrui-goi-hyo, there was lit-
with a wrong year, for example, The Constitution of the
in the dataset with the `N' label includes statements
tle dierence in their contribution to the nal results. One
U.S. was established in the early 19th century, and the
of the reasons is mismatches between the words in the re-
corresponding
S
includes the correct year (1788 in this
Table 11: Pearson correlation coefficient between the average accuracy on 5-fold CV using the training data
and the performance measures using formal-run data based on 280 different feature sets.
Test Data
BC
MC
EXAM
RITE4QA
Training Data
BC
BC+MC'
MC
MC+BC'
EXAM
BC
Perf. Measure
AC
AC
AC
AC
AC
AC
TOP1
MMR5
AC
BC+MC'
TOP1
MMR5
Correlation
0.43
0.30
0.57
0.49
0.74
-0.34
-0.42
-0.44
0.23
-0.20
-0.31
fmatch =
1, more than half of them are correctly predicted as `N'. This
8th NTCIR Workshop Meeting on Evaluation of
Information Access Technologies, 2010.
suggests the advantage of the learning method that handles
[3] H. Kanayama, T. Nasukawa, and H. Watanabe.
case). Another remark is that even in the cases of
this kind of strong tendencies as a preference, instead of a
Deeper sentiment analysis using machine translation
constraint by a hand-tailored rule.
technology. In
In our feature sets, there is weak relationship between the
accuracies on training data and formal-run performance in
BC & RITE4QA subtasks. Although the good feature set
Proceedings of the 20th International
Conference on Computational Linguistics (COLING),
pages 494500, 2004.
[4] H. Kanayama, K. Torisawa, Y. Mitsuishi, and
on CV also show the good formal-run performance on MC &
J. Tsujii. A hybrid Japanese parser with hand-crafted
EXAM subtasks, the best feature set on CV and formal-run
grammar and statistics. In
a correlation between the average accuracy on 5-fold CV
Proceedings of the 18th
International Conference on Computational
Linguistics (COLING), pages 411417, 2000.
using the training data and the performance measures us-
[5] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region
are dierent on BC & RITE4QA subtasks. Table 11 shows
ing formal-run data based on 280 dierent feature sets. For
example, the last column shows that there is a negative correlation between the average accuracy using the BC+MC'
training data and the MMR5 score on the RITE4QA test
data. The correlation values suggest that our feature representation of training data in BC & RITE4QA subtasks is
not good enough to induce general classication rules or ,
at least, the classication rules for the specic test data.
Newton method for large-scale logistic regression.
Journal of Machine Learning Research,
9:627650,
2008.
[6] Y. Mehdad and B. Magnini. Tree edit distance for
recognizing textual entailment: Estimating the cost of
insertion. In
PASCAL RTE-2 Challenge,
2006.
[7] Y. Mehdad and B. Magnini. Optimizing textual
entailment recognition using particle swarm
Proceedings of the 2009 Workshop on
Applied Textual Inference, pages 3643, 2009.
optimization. In
7. CONCLUSION AND FUTURE WORK
In this paper, we proposed a classication approach for
NTCIR-9 RITE. In the experiments, we presented that tree
edit distance and operations were eective feature for the
MC subtask and temporal expression matching was eective
[8] National Language Research Institute. Bunrui-goi-hyo
(revised and enlarged edition), 1996. (In Japanese).
[9] Y. Shibaki. Constructing large-scale general ontology
from wikipedia. Master's thesis, Nagaoka University of
Technology, Japan, 2011.
feature for the EXAM subtask.
One drawback of the tree edit distance approach is the
[10] H. Shima, H. Kanayama, C. Lee, C. Lin,
diculty of the design of the cost functions. Although we
T. Mitamura, Y. Miyao, S. Shi, and K. Takeda.
implemented four dierent cost functions, the dierences of
Overview of NTCIR-9 RITE: Recognizing Inference in
these edit distances were small and the edit distance based
features were eective only on the MC subtask. The super-
TExt. In
NTCIR-9 Proceedings,
2011.
[11] F. M. Zanzott, M. Pennacchiotti, and A. Romoschitti.
vised learning of the edit cost function is one of the inter-
A machine learning approach to textual entailment
esting research directions [7].
recognition.
In the EXAM subtask, the Temporal feature set worked
to increase the accuracy, and the combined use of tempo-
Natural Language Engineering,
15(4):551582, 2009.
[12] K. Zhang and D. Shasha. Simple fast algorithms for
ral and geological expressions is a interesting research line.
the editing distance between trees and related
However, it is far from the complete understanding of the
problems.
question and the world knowledge.
18(6):12451262, 1989.
To make the system
more general and robust, we need to seek ways to handle
the semantics of the whole sentences, and that of the pair
of the whole sentences.
8. REFERENCES
[1] T. Akutsu. Tree edit distance problems: Algorithms
IEICE
Transactions on Information and Systems,
and applications to bioinformatics.
E93-D(2):208218, 2010.
[2] F. Gey, R. Larson, N. Kando, J. Machado, and
T. Sakai. NTCIR-GeoTime overview: Evaluating
geographic and temporal search. In
Proceedings of the
SIAM Journal on Computing,