Syntactic Difference Based Approach for NTCIR-9 RITE Task Yuta Tsuboi Hiroshi Kanayama Masaki Ohno IBM Research - Tokyo, IBM Japan, Ltd. 1623-14 Shimotsuruma, Yamato-shi Kanagawa, Japan {yutat,hkana,moono}@jp.ibm.com Yuya Unno Preferred Infrastructure 2-40-1-4F Hongo, Bunkyo-ku Tokyo, Japan [email protected] ABSTRACT trees as the pair features to exploit syntactic dierences of This paper describes the IBM team's approach for the textual entailment recognition task (RITE) in NTCIR-9 [10] with experimental results for four Japanese subtasks: BC, MC, EXAM, and RITE4QA. To tackle the data set with complicated syntactic and semantic phenomena, the authors used a classication method to predict entailment relations between two dierent texts. classication: These features were used for (1) Tree edit distance and operations, (2) Word overlap ratios and word pairs, (3) Sentiment polarity matching, (4) Character overlap ratios, (5) Head comparisons, (6) Predicate-argument structure matching, and (7) a sentence pair. We also employed several kinds of the pair features from a lexical similarity to the matching of temporal expressions. We will explain these pair features in Section 4. We experiment with our approach on the NTCIR-9 RITE task in Section 5, and we will discuss the experimental results in Section 6. We show that the use of operations based on edit distance was eective for the MC subtask and that the temporal expression matching worked well for the EXAM subtask. In Section 7, we will conclude our work and discuss future directions. Temporal expression matching. Feature (1) reects the operations in the edit distance computation between the text and the hypothesis, which can capture the syntactic dier- 2. SUPERVISED MACHINE LEARNING ences between two sentences. In the RITE task, Feature (1) Since the BC, EXAM, and RITE4QA subtasks are binary is eective for the MC subtask and Feature (7) is eective classication problems (true or false entailment) and the MC for the EXAM subtask. subtask is a multi-class classication problem, a supervised machine learning approach is employed in these tasks. The Keywords pair of two dierent sentences (denoted as NTCIR, RITE, entailment, machine learning, syntax, tree sion edit distance pair feature space [11] and (LR) model is trained using labeled examples. x ∈ X be the feature representation of a pair of S T , y ∈ Y be an entailment label of a label set Y , and ϕ(x, y) : |X| × |Y | be the cartesian product of x and a Let Team Name: Subtasks: represented in a IBM Japanese BC, MC, EXAM, and RITE4QA External Resources: Bunrui-goi-hyo and Shibaki's ontol- and label assignment vector. LR model represents a conditional probability P(y|x) in a log -linear ogy Pθ (y|x) = 1. INTRODUCTION Textual entailment is a very complex natural language where θ form: 1 exp(θ ⊤ ϕ(x, y)), Z u notes the inner product of the vectors, for this task, textual entailment recognition requires detect- the denominator is the partition function: ing complex syntactic and semantic relations between two dierent texts. To exploit these relations, we train classi- Z= ers in the pair feature space [11], in which a text pair are ∑ In Section 2, we will explain our supervised of the labeled data between dierent entailment tasks. In Section 3, we propose to use the operations in the tree edit distance computation between two dependency parse and v. u⊤ v de- Note that exp(θ ⊤ ϕ(x, y)). y∈Y Using a training data machine learning approach. Since the number of the example is usually limited in the task, we propose the conversion (1) is the parameter vector of LR model and phenomenon. Although supervised classiers have been used represented. S and T 1 ) are a logistic regres- E ≡ {(x, y)}, 1 θ log - the parameter can be estimated by the maximization of regularized S and T correspond to T1 and T2 in the RITE data set, respectively. 愛する `love' Table 1: Data conversion. The left columns denote the original relations between S and T , and the right columns denote the converted relations. The relation symbols Y and N denote true and false entailment, F, R, and B denote forward, reverse, bidirectional entailment, C and I denote contradiction and independence, respectively. MC relation S S S S 先生だ `be teacher' !aa !! a 彼が `He-NOM' (S ) ! !! 先生を `teacher-ACC' 彼は `He-TOPIC' (T ) Figure 1: Edit distance minimization on trees BC relation F →T R →T B →T C →T I Y N N Y S → T, T → S Y Y S → T, T → S N N S → T, T → S N N S → T, T → S BC relation F, B R,C,I Y N S → T, T → S S→T t consist of morpheme sequences s = (ms1 , · · · , ms|s| ) t = (mt1 , · · · , mt|t| ), respectively. MC relation S→T S −→ T S→T S −→ T 3. (b) From BC data to MC' data (a) From MC data to BC' data and SYNTACTIC DIFFERENCE 3.1 Edit distance In this work, syntactic dierences between two sentences, S and T , are represented by edit operations used in tree edit distance calculation on the dependency trees. The syntactic tree is constructed by the Japanese syntactic parser [4] which uses both the hand-crafted grammar and likelihood: ∑ ln Pθ (y|x) + (x,y)∈E the statistical model created by a tagged corpus. The parser ||θ||2 , 2σ 2 (2) Newton-CG methods [5] and the hyper-parameter σ was determined by grid search based on 5-fold cross-validation (CV). Since the number of the training data is limited in the RITE task, we convert the training data for the MC subtask as the additional training data for BC subtask, and vice versa. For this conversion, we assumed the interchangeability of the labels between BC and MC subtasks. By the data conversion from the MC subtask to the BC subtask, two examples for the BC subtask are generated from one labeled examples on the MC subtask since MC labels convey richer entailment information than BC labels (see Table 1(a)). We denote the combined training data of BC data and the examples generated from MC data as S and T are represented t ∈ t, respectively. so that where the nal term is a Gaussian prior on θ with mean 0 and variance σ 2 . In the experiments, θ was optimized by produces dependency structures between BC+MC’ and The tree edit distance between two trees one example with ambiguous labels for the MC subtask are S and T is cal- culated based on edit operations, each of which is one of insertion, deletion, or substitution, in the similar way as edit operations for strings. an edit operation, γ(s, t) We dene a cost function for on substituting s to t. Costs for deletion and insertion operations can be denoted as special cases; γ(s, ϵ) s and γ(ϵ, t) for edit distance mapping operations. Let D ⊆ s appearing in M due to the edit distance δ(s, t) for the cost of deletion of node the cost of insertion of a node M ⊆ s × t represents and I ⊆ t be the sets t. An a set of edit of nodes not the existence in only one tree. Here is dened as the minimum cost mapping [1]: δ(s, t) = min ∑ M γ(s, t) + ∑ γ(s, ϵ) + s∈D (s,t)∈M data. By the data conversion from the BC subtask to the MC subtask, bunsetsus as nodes s∈s as trees consisting of ∑ γ(ϵ, t). t∈I Zhang and Shasha's algorithm [12] can calculate the edit dis2 2 tance δ(s, t) between s and t in O(|s| |t| ) time and space. generated from one labeled examples on the BC subtask (Ta- The edit distance computation between two syntactic trees ble 1(b)). We denote the combined training data of MC data corresponds to the word alignment between two sentences and the examples generated from BC data as MC+BC’ data. because the substitution from s to t s can be regarded as the alignment of two nodes between LR objective function (2). Let ′ Using the training data E ≡ tions on syntactic trees are advantageous compared to the bels, we maximize the ∑ L ⊆ Y be a label candidates. {(x, L)} with ambiguous laregularized marginal log -likelihood: ln (x,L)∈E ′ ∑ ||θ|| . 2σ 2 2 Pθ (y|x) + y∈L Newton-CG Such opera- word-to-word alignment between sentences as sequences of bunsetsus. For example, suppose the sentences S and T are 彼が 先生を 愛する (`He loves a teacher') and 彼は 先生だ (`He is a teacher'), respectively. When the edit distance is calculated on the sequence of bunsetsus, the edit operations with the lowest cost should be (1) substitution of 彼は to Although this is non-convex optimization problem, we can still use and t. To deal with label ambiguities in BC' data, we extend the method to nd a local-optimum. Note 彼が, (2) 生だ (`be substitution of 先生を (`teacher-ACC') to 先 a teacher') and (3) deletion of 愛する (`love') that, for MC+BC' data, we use a subset of MC data as the with a rule that the replacement of functional words can be validation set of CV. done with a small cost, and then each pair of two common In the sequel, we describe the feature representation of a pair of S and T. be a map function from a pair of x ∈ X. s = (s1 , · · · , s|s| ) length |s| in S and t Let phrases of content words 彼 (`he') and 先生 (`teacher') is connected f (S, T ) in the two sentences. However, the semantic roles of 先生 to a feature space (`teacher') are dierent in these two sentences, so this align- be a sequence of ment is not desirable to capture the dierence between two be that of bunsetsu T , where s and sentences. We rst explain notation here. Let S and T In contrast, when the edit distance is calculated on the dependency tree shown in Figure 1, the optimal operations Semantic distance metrics using an ontology (Ontology). are (1) substitution of 彼が (`he-NOM') to 彼は (`he- We also attempt to measure a semantic similarity using TOPIC'), (2) deletion of 先生を, (3) deletion of 愛する another resource and combine it with and (4) insertion of 先生だ, since the word 先生 appears other cost function. in dierent positions in S T, and and thus the tree T can HDM to dene an- To measure a semantic similarity, we use an ontology auto- not be generated with a substitution operation of the nodes matically generated from Wikipedia with Shibaki's Method[9]. with 先生. This calculation successfully avoids the connec- It denes tion between two nodes with the word 先生 which behaves up-to-date knowledge since it was constructed from Wikipedia dierently in two sentences. which is actively updated. 3.2 Cost functions content word, and calculate their semantic similarity with is-a relations Given a pair of Here we consider four types of cost functions: WO, BGH, among about 730,000 words. It has bunsetsus s and t, we focus on each head the shortest path length in the ontology. Let spl(s, t) be a HDM and Ontology. These functions calculate the costs of function which returns the shortest path length in the ontol- substitution operations dierently, but the costs for deletion ogy between the head content words of two input i.e. γ(s, ϵ) = 1 in the all cases. bunsetsus, γ(ϵ, t) = 1, then the following condition is newly inserted after the rst The feature representation based on the condition in the cost function of the HDM method dened and insertion are set to one, and optimal edit operations will be shown in Section 4. above. Note that it is valid only when the head content words of the both Jaccard distance metrics using word overlap (WO). { γ(s, t) = 0.7 − 0.5 · Jaccard distance bunsetsus s and t: One of the cost functions is dened as between the morpheme sets of two |s ∩ t| . |s ∪ t| γ(s, t) = 1 − bunsetsus 4. are found in the ontology. 1 spl(s,t) a path found PAIR FEATURES In this section, we describe the representation of the edit Semantic distance metrics using Bunrui-goi-hyou (BGH). operations and other features used in the RITE task. The names in parenthesis denote the IDs of feature sets which Another method to calculate the cost is based on semantic similarity between a pair of bunsetsus. are referred in the experimental results. Here we use the the Japanese thesaurus, extended version of Bunrui-goi-hyo [8], Edit distance and operations (EDO). The edit distance and operations in Section 3 are repre- which assigns semantic codes around 230,000 words. We focus on the head content words, when we calculate semantic similarity between a pair of bunsetsus. A head sented as elements of a feature vector. We use the normalized value of the edit distance δ(s, t)/ max(|s|, |t|) ranging content word is dened as the rightmost content word in a from 0 to 1. Although the value of the tree edit distance is bunsetsu, used as a feature in the previous work [6], we also use edit and a content word is either a noun, a verb, an adjective, an adverb, an interjection or their variants. Given two bunsetsus s and t, we compare each code of operations as the pair features. For each D and I, bunsetsu in both we use the base form and part-of-speech (POS) cs and ct be the Bunruigoi-hyo codes and common(c1 , c2 ) is a common depth in the ′ thesaurus tree. e.g. common(c1 , c2 ) = 3 when cs = ‘3630 ′ and ct = ‘36352 . tag of each morpheme and these of the last morpheme as a We dene a cost function as follows, giving a smaller cost the last morpheme, and 4) the pair of the POS tags of the head content words of them. Let for more similar representation of the M, in bunsetsu. For each pair of bunsetsu s we use 1) the sequence pair of the base forms, 2) the sequence pair of POS tags, 3) the pair of the base forms of last morphemes. Note that the granularity of the POS tags bunsetsus: described above is coase, such as noun or verb. We also used 1 γ(s, t) = 1 + common(cs , ct ) more ne POS tags, which are denoted as POS ne in the experimental results. The value of the edit operation features is either the num- Heuristic distance metrics (HDM). ber of the edit operations, denoted as Count, or binary Another method to measure the tree edit distance is the combination of BGH metrics and other conditions, by com- value, denoted as Binary. To make sure its value ranges from 0 to 1, Count is divided by max(|s|, |t|). paring head content words and parts-of-speech. We call this Since the edit operation features varies depending on the method HDM (Heuristic Distance Metrics), which is calcu- cost functions dened in Section 3, we also employed the lated as follows: combination of multiple feature sets based on dierent cost 0 γ(s, t) = 0.7 − 0.05 · common(cs , ct ) 1.0 functions to represent a pair of same word same POS otherwise where `same word' means that the canonical forms of the head content words of s and t are the same, `same POS' means that the parts-of-speech of the head content words are the same, and common() is in the BGH part. s and t. Overlap ratios of words and word pairs (Word). Let T, mS and mT be the set of content words in S and respectively. The value of word overlap ratio feature is dened as |mS ∩ mT |/|mT |. The word pair feature is all (ms , mt )|ms ∈ mS , mt ∈ mT of the con- the combination tent words. We use only the word pair features appearing more than once in the training data. Table 2: Example of ‘Sentiment’ feature set. sentence pairs S T Table 3: Example of sentence pairs that activate the feature fP AS . features PET(ポジトロン断層撮影)検査は、肺がん、 大腸がん、食道がんなど、ほとんどのがんの 診療に有効とされている。 `The PET (positron emission tomography) is believed to be eective for the care of most types of cancers such as ...' fpol = (+, +) fsame = 1 T スーザン・トレスさんは脳死になった。 `Ms. Susan Torres became brain dead due to melanoma ...' `Ms. Susan Torres became brain dead.' 日本で臓器移植法が施行されて7年以上になる。 `PET helps the care of cancers.' 山田洋次は映画監督です。 S 失われた10年は立ち遅れた反省や経験が生か され、無駄でなかった。 `The organ transplantation law have been eective for 7 years in Japan.' S `Director Yoji Yamada is good as making scenes of men's weeping.' T T スーザン・トレスさんは極めて悪性度の高いがんの一種 メラノーマが脳に広がり、脳死になった。 PETはがんの診断に役立っている。 山田洋次監督は男泣きの場面を作るのがうまい。 S S `Yoji Yamada is a lm director.' `The Lost Decade was not wasted because ...' 失われた10年は無駄だった。 `We learned nothing from the Lost Decade.' fpol = (+, −) fdif f = 1 fopp = 1 日本で臓器移植法は施行された。 T fpol = (+, 0) fdif f = 1 `The organ transplantation law became eective in Japan.' The syntactic tree is converted to a set of predicate-argument structures to examine whether predicate type: entation of the two sentences. positive or negative polarity and Intuitively if the S entails T, T S has a is likely to covers all information in S. two types: a bunsetsu led by a verb or an adjective as a predicate, and zero or more postpositional phrases Polarity matching with sentiment detection (Sentiment). We also introduced a feature to conrm the semantic ori- T A predicate-argument structure used here is either of these with a case marker as arguments modification type: a modier bunsetsu and a modiee bunsetsu, such as adverbial modication have the same polarity. We used an existing sentiment de- For example, the sentence (3) is converted to the following tection engine [3], which detects positive or negative clauses set of the predicate-argument structures. (P1) is a predicate in a sentence with high precision. The engine handles not type, and (P2) and (P3) are examples of the modication only subjective utterance but also facts that suggests posi- type. tive or negative attributes. In most of cases no polarity is detected, and thus this feature is expected to mainly work as a negative clue for the entailment when S and T 彼は 大きな 駅へ ゆっくり 行った。 have op- (`He went to a big station slowly.') posite polarities. This feature set consists of the following features: fpol : The combination of the polarities of fsame : Whether S and T S and T has the same polarity (e.g. positive (P1) 行く (彼, 駅) (`go (he, station)') (P2) 大きな ⟨ 駅 ⟩ (`big (P3) ゆっくり ⟨ 行く ⟩ vs. positive) We use a feature fdif f : Whether S and T has the dierent polarities (e.g. positive vs. neutral) fopp : Whether S and T has the opposite polarities (e.g. pos- Table 2 shows examples of assignment of `Sentiment' features for some sentence pairs. cS and dened as cT be the set of characters in S and T, re- The value of character overlap ratio feature is in a sentence, the rightmost bunsetsu in the Japanese language, usually conveys the main concept of the sentence, and therefore we use a feature to exam- bunsetsu of the two sentences are bunsetsus are compared, both words ine whether the head the same. When two are converted to their canonical form and the dierence in some symbols such as commas is ignored. of the fulllment test. The fP AS = T fP AS = only if the all of the predicate-argument structures in S , and otherwise p1 and p2 has the same predicate and all of the arguments of p2 appear also in p1 . For instance, a predicate-argument structure 行く (駅) is 0. p1 subsumes p2 mean that subsumed by (P1) in the above example. Table 3 exemplies sentence pairs in which fP AS = 1. This feature is expected duction of hyponym and hypernym relations using WordNet were attempted, however, few pairs of nouns in S and T matched the relations, so we gave up to use WordNet for this future. Head comparisons (Head). bunsetsu fP AS ⟨go⟩') To improve the coverage of this subsumption, the intro- |cS ∩ cT |/|cT |. The head (`slowly to be a strong clue for the entailment. Overlap ratios of characters (Char). Let 1 ⟨station⟩') are subsumed by one of those in itive vs. negative) spectively. (3) Matching of temporal expressions (Temporal). As proven by the successful of the GeoTime task in the previous NTCIR [2], temporal information has an important role in information retrieval, and actually many temporal expressions appear in the development set of the data, especially in the EXAM subtask, and thus we added a feature set to compare the temporal expressions in the sentence pairs, with two features: Fulfillment tests with predicate-argument structures (PAS). fmatch : Temporal expressions appear in both at least one pair has an overlap S and T and funmatch : Two temporal features appear in both S and T and none of the pairs have an overlap Table 5: The confusion matrices in the BC, EXAM, RITE4QA tasks. where the overlap is determined based on the ranges of the Year: N 年 (`the year of N ') is converted to the year range [N, N ]. The Japanese calendar scheme is also covered, for example, 昭和 50 年 is converted to [1975, 1975]. System years, by using the following patterns: N System N reduce the width of the year range. For example, when two sentences expression 1620 年代 (`1620s') and T S [1601, 1650], (c) RITE4QA include a temporal F R B C I F 87 7 7 4 5 R 6 89 12 0 3 Correct B C 21 30 20 9 30 16 1 5 3 5 I 40 16 6 4 14 has 17 世紀前半 (`the rst half of the 17th century'), their ranges are and (b) EXAM Y N Correct Y N 65 521 41 337 Table 6: The confusion matrix in the MC task. 世紀 (`N th century') is converted to the year range [100(N −1)+1, 100N ]. The suxes 前半, 後半 and some other variations such as 初頭 (`beginning') Century: Correct Y N 103 45 78 216 Y N (a) BC 年代 (`the decade from N ') is converted to the year range [N, N + 9]. The suxes 前半 (`the rst half ') and 後半 (`the latter half ') are also considered, for example, 1920 年代前半 is converted to the year range [1920, 1924]. Decade: Y N Correct Y N 137 125 113 125 [1620, 1629] fmatch is respectively, and thus the feature set to true since the year ranges overlap. 5. EXPERIMENTAL RESULTS 5.2 MC subtask The LR models using EDO features show much better performance than the LR model using all the features except EDO. Based on LR models trained on the the MC We employ the sevral combinations of the cost functions data, the formal-run accuracy averaged over all of the fea- for edit distance described in Section 3 and the pair feature ture sets includes EDO is 47.1% comparing with 35.9% of sets described in Section 4. We basically select feature sets the system without EDO. The best LR model achieve 51.6% for the formal-run submissions by the average accuracies on accuracy on the formal-run test data. Again, the LR mod- 5-fold cross-validation (CV) using the training data. els trained using the MC data perform better than the LR Table 4 shows the average accuracies on CV and the formalrun results, and Table 5 and 6 show the confusion matrices of the systems using a feature set which show the best CV models trained using the MC+BC' data. 5.3 EXAM subtask accuracies is used. Although we evaluate 280 feature sets for Most of the LR models using the matching of temporal all the subtasks, because of the limitation of space, we only expressions (Temporal) perform better than that not using report the results of the submitted systems for the formal- Temporal. The best LR models achieve 72.6% accuracy on run, all of the feature sets using the HDM cost function, and the formal-run test data. the best feature sets at any performance measure in Table 4, and report the confusion matrices of the feature set which 5.4 mark the best CV accuracy in each subtask. In Table 4, the submitted system outputs are denoted by the super script of accuracy values. Since we found and xed bugs after the formal-run for BC and MC subtasks, we report the revised scores obtained after the bug-x. The top column names denote test data, the second column names denote training data for LR models, and the third column names denote performance measures where CV stands for the average accuracy (%) on 5-fold cross-validation, AC stands for accuracy (%) on formal-run test data. 5.1 BC subtask RITE4QA subtask The LR model trained on the BC+MC' data using all the pair features except EDO shows the best performance, Accuracy=72.6% , TOP1=18.1%, and MMR5=29.0% . We observed a inverse correlation or week correlation between the CV accuracy on the BC or BC+MC' training data and the accuracy or QA performance measures on the RITE4QA formal-run data. 6. DISCUSSION The edit distance and operations (EDO) are signicantly eective features in the MC subtask. To put it more pre- In terms of the formal-run accuracy, LR models trained on cisely, the classication of forward (F), reverse (R), and bidi- the BC data using edit distance and operations (EDO) based rectional (B) entailment using those features is more accu- on BGH cost function and most of the other pair features rate than that without the edit distance and operations. achieve relatively high accuracy and the best system achieve Table 7 shows confusion matrices in the MC subtask when 56.0% using binary-valued edit operation features. The LR either of the edit distance and operations are used. As an models trained on the BC data perform better than the LR example, Table 7(a) shows the result of the edit distance models trained on the BC+MC' data. Comparing with the and operations derived by the HDM function. Comparing best system on the formal-run data, the best feature sets of with Table 7(b), the number of the correct predications, the CV on the training data show slightly worse performance diagonal numbers of the matrices, of F, R, and B is larger (52.4% for the BC data and 51.6% for the BC+MC' data). in Table 7(a). Table 4: Accuracies and QA performance scores on subtasks. CV stands for the average accuracy on 5-fold cross-validation using training data, and AC stands for the accuracy on formal-run data. The super script of accuracy values indicates the result of submitted system output: a) IBM-JA-BC-01 (51.6), b) IBM-JA-BC-02 (52.6), c) IBM-JA-BC-03 (50.0), d) IBM-JA-MC-01 (45.5), e) IBM-JA-MC-02 (51.1), f ) IBM-JA-MC-03 (44.8), g) IBM-JA-EXAM-01, h) IBM-JA-EXAM-02, i) IBM-JA-EXAM-03, j) IBM-JA-RITE4QA-01, k) IBM-JA-RITE4QA-02, and l) IBM-JA-RITE4QA-03. Values in parenthesis are the submitted formal-run results of BC/MC subtasks before the bug-fix. Note that the accuracy of IBM-JA-BC-02 with the bug is same as that without the bug. Cost Function None HDM WO BGH Ontology HDM & WO HDM & BGH HDM & BGH & WO HDM & Ontology & WO HDM & BGH & Ontology & WO Test Data Training Data Performance Measure Word + Sentiment + Char + Head + PAS + Temporal EDO Count, POS ne +Word +Sentiment +Char +Head +PAS +Temporal EDO Binary, POS ne +Word +Sentiment +Char +Head +PAS +Temporal EDO Count +Word +Sentiment +Char +Head +PAS +Temporal EDO Binary +Word +Sentiment +Char +Head +PAS +Temporal EDO Binary, POS ne + Word + Sentiment + Char + Head + PAS + Temporal EDO Binary, POS ne + Word + Sentiment + Char + Head + PAS EDO Binary + Word + Sentiment + Char + Head + PAS EDO Count, POS ne + Word +Sentiment + Char + Head + PAS + Temporal EDO Binary, POS ne + Word EDO Count + Word + Sentiment + Char + Head + PAS +Temporal EDO Count, POS ne + Word + Sentiment + Char EDO Count + Word + Sentiment + Char + Head + PAS + Temporal EDO Count + Word + Sentiment + Char + Head + PAS EDO Binary +Word + Sentiment + Char + Head EDO Count, POS ne +Word EDO Count, POS ne + Word +Sentiment + Char +Head + PAS + Temporal EDO Binary EDO Count, POS ne + Word + Sentiment + Char + Head + PAS + Temporal EDO Binary, POS ne + Word + Sentiment + Char + Head + PAS + Temporal BC CV 52.8 BC MC EXAM BC+MC' MC MC+BC' EXAM AC CV AC CV AC CV AC CV AC 52.0 57.5 48.2 33.6 35.9 35.5 31.4 67.9 69.5 49.0 47.8 50.2 51.8 52.4 50.8 52.4 50.8 53.0 50.4 54.2 52.6 54.2 51.6 50.0 50.0 50.6 49.6 50.8 50.2 50.8 50.2 51.4 49.0 50.8 49.6 51.0 49.6 50.2 48.2 50.8 52.0 52.6 51.8 52.6 51.6 53.8 51.6 54.8 52.2a 54.6 51.8 51.2 49.8 50.8 51.4 52.0 52.0 52.4 52.4 52.6 51.8 53.2 52.2 53.0 52.6 48.8 50.2 64.1 62.4 61.7 62.5 62.2 62.2 62.3 63.8 64.5 64.6 64.4 64.1 64.0 64.3 63.9 60.8 60.9 61.9 61.8 62.0 62.0 65.0 64.3 63.8 64.0 63.9 64.1 64.0 64.1 47.2 48.8 47.4 48.0 49.4 49.2 49.4 46.6 47.4 46.0 48.8 47.6 47.4 50.0 48.8 48.6 47.6 47.0 47.4 46.4 46.8 47.6 47.4 48.6 49.0 47.8 50.2 50.2 48.0 49.1 49.1 48.9 48.0 49.1 49.1 49.1 48.6 48.6 48.9 49.3 49.8 49.8 49.8 49.1 49.1 47.5 48.2 48.0 48.2 48.2 45.2 47.5 47.0 45.9 49.3 48.0 49.3 48.4 48.2 51.1 49.5 48.6 49.1 48.9 48.6 44.1 46.1 45.0 45.7 44.8 44.8 44.8 47.0 50.0 48.9 48.9 48.6 48.6 48.6 45.2 45.2 44.3 45.5 45.7 45.7 45.2 48.0 49.4 55.0 63.7 54.8 46.8 45.9 48.3 51.8 56.0 64.2 53.6 46.1 46.6 45.2 51.6 53.8 51.8 62.2 52.2 62.3 49.2 49.3 49.6 49.3 50.7 49.9 48.4 49.5 50.2 50.8 55.0 50.2 64.3 52.4 61.4 48.4 49.1 48.2 48.2 45.7 49.0 48.4 49.2 45.2 61.5 62.4 46.7 46.1 62.5 67.4h 31.6k 54.8 52.8 52.4 61.7 50.6 62.8 47.2 48.2 48.4 50.0 48.4 49.2 46.8 52.4 46.4 68.3 47.0 64.1 55.0 51.6 62.6 47.4 47.3 48.2 49.9 54.0 52.6b 62.3 48.0 48.9 49.1 49.2 52.8 52.6 51.6 65.5 52.8 65.2 47.8 51.0 51.4 52.0 51.4 50.6 51.8 48.2 52.6 52.0 51.2 51.0 50.2 51.6 51.4 47.4 64.4 64.1 63.6 63.6 63.7 63.4 63.3 63.8 49.4 51.6 47.0c 47.4 47.6 48.4 49.0 49.2 48.6 46.1 46.4 61.8 62.0 61.5 62.7 62.4 62.4 65.8 62.0 38.5 29.5 32.0 31.5 31.0 30.8 30.9 54.5 46.9 44.8 44.8 51.2 44.9 44.7 37.8 28.7 30.1 30.3 30.2 31.6 28.2 55.2 53.6 52.3 53.0 52.6 51.7 51.8 53.6 13.9 10.0 9.5 9.3 9.8 8.8 10.3 11.9 14.7 14.7 14.7 11.5 14.2 14.2 12.9 10.3 9.0 9.3 9.6 9.1 9.1 10.1 10.6 10.6 11.5 12.5 11.5 11.5 14.3 43.9 60.9 59.0 45.5 42.0 60.7 60.0 45.6 62.3 67.4 72.2 29.6 30.3 46.4 62.5 62.1 62.1 64.1 63.7 63.7 69.1 60.1 61.1 61.7 61.9 61.5 61.5 62.3 61.9 61.3 61.3 62.5 62.5 62.5 68.3 60.5 61.5 61.1 61.5 61.9 61.9 63.5 63.3 69.1 59.5 67.4 67.0 67.6 67.6 67.6 72.2g 60.0 61.3 62.0 62.0 62.2 62.2 62.7 60.0 67.2 67.2 67.6 67.4 67.4 72.6 25.6 26.8 34.4 32.1 35.5 34.3 34.0 32.9 36.4 40.5 38.1 42.9 42.3 34.6 42.5 26.9 33.1 40.9 38.1 37.7 37.1 35.3 39.3 42.1 43.4 44.7 41.8 44.4 45.4 31.1 12.7 12.8 11.8 13.1 13.8 13.3 14.3 11.6 12.9 12.2 13.7 13.7 13.6 13.7 12.2 10.8 12.0 14.6 14.6 14.6 14.6 10.6 12.1 12.9 14.2 14.7 13.2 14.7 14.6 24.0 24.6 24.1 25.3 25.7 25.5 26.0 23.4 24.2 24.2 25.1 25.2 25.2 25.0 23.7 23.4 24.2 26.0 26.0 25.9 26.1 23.2 24.1 24.4 25.6 26.0 25.2 26.0 25.7 12.0 24.2 38.6 10.8 24.5 10.0 23.9 37.3 13.3 26.1 11.0 9.8 22.2 22.5 33.5 32.9 12.8 14.3 24.7 25.9 9.1k 25.0 21.7k 42.0 35.3 12.4 12.8 24.2 25.3 15.2 24.6 22.0 21.5 21.6 22.3 21.8 22.8 23.5 24.7 24.9 25.0 23.5 24.8 24.9 24.4 21.9 21.3 21.7 22.2 21.8 21.9 22.5 23.0 23.0 23.5 24.2 23.7 23.7 67.6 30.3 30.1 9.1 9.8 22.1 22.0 36.4 32.0 14.6 11.8 26.1 24.8 47.5 67.9 71.3 32.2 11.6 23.2 32.1 12.1 24.8 45.9 62.5 67.0 29.4 9.8 22.3 34.8 44.3 46.6 44.1 60.7 43.6 46.8 43.6d 61.3 60.0 60.9 54.5 52.6 11.9 11.5 13.1 8.7l 11.1 22.2l 24.0 61.1 31.5 65.8 28.7 67.0 31.4 66.7 30.7 71.5 33.3j 60.9 55.9 72.6 39.8 11.1 11.5 11.5 10.3 11.3j 10.6 10.5 22.7 22.7 22.7 22.5 23.3j 22.8 23.5 29.5 28.4 28.5 30.6 30.9 42.7 28.6 13.0 11.8 12.3 11.3 14.3 8.9 12.6 24.1 23.8 23.8 24.4 26.0 22.1 24.8 14.0 25.1 43.2 10.8 23.3 49.3 48.0 50.2 50.2e 51.1 51.6 52.3 47.3 50.0 48.2 46.6 45.5 51.4 48.6 48.4 47.5 49.6 47.5 50.2 49.5 51.5 48.0 51.3 47.3 49.5 45.9 49.2 46.1 49.5 45.9 47.9 40.9 48.8 44.1 50.9 45.2 50.8 44.5 51.3 45.7 51.3 44.5f 51.1 44.5 48.4 47.7 48.8 50.0 50.2 46.4 50.2 47.7 48.8 46.4 49.2 46.1 49.2 46.4 47.7 43.2 47.3 45.0 49.1 44.3 49.3 45.7 49.1 45.0 49.1 44.8 49.1 44.8 47.9 45.7 RITE4QA BC BC+MC' AC TOP1 MMR5 AC TOP1 MMR5 34.5 5.5 19.8 63.5 18.1 29.0 49.0 50.9 51.1 52.4 51.5 45.9 50.8 45.7 48.8 47.7 61.1 49.8 62.7 50.2 62.1 47.3 64.7 47.5 67.1 45.0 59.9 47.3 66.7 72.6 41.8 61.5 58.4i 49.3 23.8 40.1l 24.2 41.9 25.0 F 81 3 9 11 6 F R B C I Correct R B C 5 26 29 88 18 11 13 28 14 0 2 5 4 1 6 I 40 18 5 5 12 F 59 17 16 12 6 (a) With edit distance and operations Correct R B C 28 47 23 60 6 12 9 16 13 3 5 10 10 1 7 I 25 26 7 9 13 (b) Without edit distance and operations EXAM |s| − |t| 4.68 0.44 6.17 9.11 HDM 7.54 6.43 11.64 12.47 BGH 8.52 7.47 12.59 13.58 WO 8.32 7.26 12.82 13.49 Ontology 7.53 6.42 11.64 12.47 バジル、セージ、タイムなどフレッシュハーブの消費量は ウナギのぼりである。 T ハーブの消費が増えている。 `The consumption of fresh herbs such as basil, a sage, and thyme is rapidly increasing.' `The herbs consumption is increasing.' 今回の事件では、男子学生が「ハロウィーンの夜に やってやる」と知人に漏らすなど、決行日を選んだ上で、 事件を起こしていたことが明らかになっている。 T `In this incident, it is clear that the boy students chose the day to act in advance and said to his acquaintance, "I am going to act at the Halloween night".' 男子学生は決行日を10月31日に選んだ。 `The boy student decided to act on October 31. ' Table 10: The confusion matrices in the EXAM subtask when either of Temporal feature value is true. RITE4QA System MC S S Table 8: The average difference number of bunsetsus and the average edit distance of each cost function on the test data of each subtask. BC Table 9: Examples of sentence pairs containing phrases with similar meanings. Y N (a) Correct Y N 21 4 3 4 System System Table 7: The confusion matrix in the MC task when either of the edit distance and operations are used. fmatch = 1 (b) Y N Correct Y N 0 0 1 29 funmatch = 1 However, as described in Section 5, the EDO features are not eective in the other subtasks. One reason of the eectiveness in the MC subtask could be the simplicity of the sources and the result of the morphological analysis. syntactic dierences observed in the MC data. As an index example, the ontology contained グスタフ・シュトレーゼ of the simplicity of the syntactic dierences, the dierence マン bunsetsus between S and T on |s| − |t| at the second row of Table 8. in the number of test data, denoted as Note that |s| − |t| (`Gustav Stresemann') but our system splited it into three morphologies, グスタフ (`Gustav') and ・, シュト レーゼマン (`Stresemann'). To increasing the coverage of The the detection of similarities, it is important to use the same of MC data is nearly zero (0.44) and much unit for resource construction and morphological analysis these numbers are similar to those at training data. value of For smaller than that of other data. Since our cost functions for insertion and deletion are simply set to one and the cost or to adjust the result of the morphological analysis to the resource unit. functions for substitution are less than or equal to one, our Our system with the two resources did not recognize some design of cost function prefers substitutions and tends to phrase relations with similar meanings. Table 9 exemplies overlook possible important insertions and deletions. How- two sentence pairs which have phrases with similar mean- S and ings. In the former pair, the key to analyze their relation is are similar, the edit distance computation with few inser- to recognize ウナギのぼりである (`be rapidly increasing') tions and deletions could be a natural alignment to exploit is similar in meaning to 増えている (`be increasing'). The syntactic relations of two sentences. On the other hand, the other pair suggests the need of the knowledge that Halloween syntactic dierences in the other data may be much compli- is October 31. cated than that in the MC data. This complexity of syntac- constructed on various perspectives for the recognition of tic dierences is also a possible explanation why LR models phrase-level correspondences. ever, if the number of nodes and the tree structures of T using the converted data, BC' and MC', do not perform well on the dierent subtasks. It is important to arrange some resources The feature set `Temporal' worked very well in the EXAM subtask, due to the frequent appearance of the expressions of One remark is that, although we design four dierent cost the year, decade or century in the EXAM data. Table 10(a) functions, WO, BGH, HDM, and Ontology, the dierence of and Table 10(b) show the relations of the system outputs the resulting edit distances are small as shown in Table 8. and the correct tags when the temporal expressions matched And we could not observe superiority of any cost function together (fmatch in the nal entailment classications. did not overlap (funmatch = 1) and when the temporal expressions = 1), respectively. These results We compared two resources used for measuring seman- show that the mismatch of the temporal expressions will be a tic similarity based cost function, the ontology constructed strong hint for the non-entailment relation. This is because T from Wikipedia and Bunrui-goi-hyo. Although the ontology often has a larger vocabulary than Bunrui-goi-hyo, there was lit- with a wrong year, for example, The Constitution of the in the dataset with the `N' label includes statements tle dierence in their contribution to the nal results. One U.S. was established in the early 19th century, and the of the reasons is mismatches between the words in the re- corresponding S includes the correct year (1788 in this Table 11: Pearson correlation coefficient between the average accuracy on 5-fold CV using the training data and the performance measures using formal-run data based on 280 different feature sets. Test Data BC MC EXAM RITE4QA Training Data BC BC+MC' MC MC+BC' EXAM BC Perf. Measure AC AC AC AC AC AC TOP1 MMR5 AC BC+MC' TOP1 MMR5 Correlation 0.43 0.30 0.57 0.49 0.74 -0.34 -0.42 -0.44 0.23 -0.20 -0.31 fmatch = 1, more than half of them are correctly predicted as `N'. This 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies, 2010. suggests the advantage of the learning method that handles [3] H. Kanayama, T. Nasukawa, and H. Watanabe. case). Another remark is that even in the cases of this kind of strong tendencies as a preference, instead of a Deeper sentiment analysis using machine translation constraint by a hand-tailored rule. technology. In In our feature sets, there is weak relationship between the accuracies on training data and formal-run performance in BC & RITE4QA subtasks. Although the good feature set Proceedings of the 20th International Conference on Computational Linguistics (COLING), pages 494500, 2004. [4] H. Kanayama, K. Torisawa, Y. Mitsuishi, and on CV also show the good formal-run performance on MC & J. Tsujii. A hybrid Japanese parser with hand-crafted EXAM subtasks, the best feature set on CV and formal-run grammar and statistics. In a correlation between the average accuracy on 5-fold CV Proceedings of the 18th International Conference on Computational Linguistics (COLING), pages 411417, 2000. using the training data and the performance measures us- [5] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region are dierent on BC & RITE4QA subtasks. Table 11 shows ing formal-run data based on 280 dierent feature sets. For example, the last column shows that there is a negative correlation between the average accuracy using the BC+MC' training data and the MMR5 score on the RITE4QA test data. The correlation values suggest that our feature representation of training data in BC & RITE4QA subtasks is not good enough to induce general classication rules or , at least, the classication rules for the specic test data. Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627650, 2008. [6] Y. Mehdad and B. Magnini. Tree edit distance for recognizing textual entailment: Estimating the cost of insertion. In PASCAL RTE-2 Challenge, 2006. [7] Y. Mehdad and B. Magnini. Optimizing textual entailment recognition using particle swarm Proceedings of the 2009 Workshop on Applied Textual Inference, pages 3643, 2009. optimization. In 7. CONCLUSION AND FUTURE WORK In this paper, we proposed a classication approach for NTCIR-9 RITE. In the experiments, we presented that tree edit distance and operations were eective feature for the MC subtask and temporal expression matching was eective [8] National Language Research Institute. Bunrui-goi-hyo (revised and enlarged edition), 1996. (In Japanese). [9] Y. Shibaki. Constructing large-scale general ontology from wikipedia. Master's thesis, Nagaoka University of Technology, Japan, 2011. feature for the EXAM subtask. One drawback of the tree edit distance approach is the [10] H. Shima, H. Kanayama, C. Lee, C. Lin, diculty of the design of the cost functions. Although we T. Mitamura, Y. Miyao, S. Shi, and K. Takeda. implemented four dierent cost functions, the dierences of Overview of NTCIR-9 RITE: Recognizing Inference in these edit distances were small and the edit distance based features were eective only on the MC subtask. The super- TExt. In NTCIR-9 Proceedings, 2011. [11] F. M. Zanzott, M. Pennacchiotti, and A. Romoschitti. vised learning of the edit cost function is one of the inter- A machine learning approach to textual entailment esting research directions [7]. recognition. In the EXAM subtask, the Temporal feature set worked to increase the accuracy, and the combined use of tempo- Natural Language Engineering, 15(4):551582, 2009. [12] K. Zhang and D. Shasha. Simple fast algorithms for ral and geological expressions is a interesting research line. the editing distance between trees and related However, it is far from the complete understanding of the problems. question and the world knowledge. 18(6):12451262, 1989. To make the system more general and robust, we need to seek ways to handle the semantics of the whole sentences, and that of the pair of the whole sentences. 8. REFERENCES [1] T. Akutsu. Tree edit distance problems: Algorithms IEICE Transactions on Information and Systems, and applications to bioinformatics. E93-D(2):208218, 2010. [2] F. Gey, R. Larson, N. Kando, J. Machado, and T. Sakai. NTCIR-GeoTime overview: Evaluating geographic and temporal search. In Proceedings of the SIAM Journal on Computing,