IBM Research Syntactic Difference Based Approach for NTCIR-9 RITE Task Team ID: IBM Yuta Tsuboi, Hiroshi Kanayama, Masaki Ohno IBM Research – Tokyo Yuya Unno Preferred Infrastructure Submitted results for four Japanese RITE subtasks (BC, MC, EXAM and RITE4QA) - Performed the best score in MC and EXAM subtasks 1 © 2011 IBM Corporation IBM Research Overview of the approach • Very difficult task due to the real-world complex dataset Impossible to write down hand-crafted rules Machine learning approach with various features • Similarities and differences between <S> and <T> * are the key features for entailment determination Matching on the surface is not sufficient Calculation of tree edit distance between parsed trees Alignment of two sentences considering syntactic structures Estimation of similarity (small edit operation costs similar pair) Edit operations (insert, delete and replacement) as features of ML 2 * <S> and <T> denote an input sentence pair (<t1> and <t2> in the original file).© 2011 IBM Corporation IBM Research Techniques and resources for our machine learning approach <S>,<T> Resources Input Syntactic parsing Japanese thesaurus S, T Ontology Tree edit distance computation Pair feature extraction Features Training Expanded training corpus 3 Logistic regression/ Cross-validation Prediction Classifier Model Output Labels Y/N F/R/B/C/I © 2011 IBM Corporation IBM Research Tree edit distance - Concept Hypotheses – Two sentences (<S> and <T>) have syntactic similarities and differences – A pair of similar sentences has high possibility of entailment – Difference parts can be clues for the determination of entailment Solution – Parse two sentences – Align parse trees by calculating the tree edit distance between them <S>学校に毎日行く</S> <T>学校に行く</T> 行く go 学校に to school 毎日 everyday <S>大統領が殺された</S> <T>熊が殺された</T> 行く go 学校に to school delete an an adverb adverb “everyday” “everyday” delete 4 殺された be killed 殺された be killed 大統領が president 熊が bear replace “president” “president” into into “bear” “bear” replace © 2011 IBM Corporation IBM Research Tree Edit Distance – General Implementation Edit distance δ: Edit operations Replacement Deletion Insertion s s Replace Delete t Cost for replacement: γ(s, t ) Insert t Cost for deletion: γ(s,ε) Cost for insertion: γ(ε, t) • Edit distance computation: O(|s|2|t|2) time and space • Our code for tree edit distance is available at https://github.com/unnonouno/tree-edit-distance (new BSD license) 5 © 2011 IBM Corporation IBM Research Cost Functions for Tree Edit Distance Insertion / Deletion cost: constant. –γ( s, ε) = γ(ε, t ) = 1 Replacement cost: a smaller value for a more similar bunsetsu pair –Mixed various metrics for similarity: Cost functions How to measure the similarity of two bunsetsus Jaccard distance metrics using word overlap (WO) Overlap ratio between each morpheme set (handling both content and functional words) Semantic distance metrics using an ontology (Ontology) Inverse of shortest path length of two head content words in the ontology (see next slide) Semantic distance metrics using thesaurus (BGH) Common depth of two head content words in the thesaurus tree Heuristic distance metrics (HDM) Similarity value considering parts-of-speech and the thesaurus above *National Language Research Institute. Bunrui-goi-hyo(revised and enlarged edition), 1996. (In Japanese). 6 © 2011 IBM Corporation IBM Research Semantic Similarity and Resources We defined two measures for semantic similarity with two complementary resources. Manually generated resource: Japanese Thesaurus [分類語彙表] Automatically generated resource: IS-A Ontology generated from Wikipedia * High coverage for basic Japanese expression Up-to-date knowledge using the latest edition category category 人間 (human) instance 感覚 ( sense ) 視覚 (visual sense) ソーシャル・ネットワーク・サービス ( Social Network service ) 感情 (emotion) 聴覚 (hearing sense) Similarity measure: Common depth of two head content words in the tree … Google+ instance Foursquare ソーシャル・ブックマーク (Social Bookmark) Delicious Tumbler Similarity measure: inverse of shortest path length in the ontology tree *Y. Shibaki. Constructing large-scale general ontology from wikipedia. Master's thesis, Nagaoka University of Technology, Japan, 2011. 7 © 2011 IBM Corporation IBM Research Pair features (1) – Similarity and difference between S and T Represent a sentence pair with several features Train the logistic regression model using the annotated data Edit distance and operations (EDO) Normalized edit distance: δ(s, t) max(|s|, |t|) insertion / deletion e.g. Edit operations: 0 = same sentence 1 = totally different word Deletion : “everyday” Deletion : ADVERB POS replacement e.g. Replacement: “president”-”bear” Replacement: Noun - Noun Word overlapping (Word) Overlap ratio: 8 ms ∩ mt word pair POS pair All combinations of head content words | mt | Word pairs: (school, school) (go, school) (everyday, school) (school, go) (go, go) (everyday, go) © 2011 IBM Corporation IBM Research Pair features (2)- Ad-hoc strong clues Sentiment polarity matching • Applied existing sentiment detector • “It is excellent” positive • “I don’t like this” negative • Sentiment orientation of the sentence pair – Same polarity Strong – Different polarity clue for non-entail – Opposite polarity PAS fulfillment test (PAS) • Convert S and T to the sets of predicate-argument structures • Whether all predicate-argument structures in T are covered by 9 those in S Examples of fulfillment Strong clues for entailment © 2011 IBM Corporation IBM Research Pair features (3) – Designed for EXAM subtask Temporal Matching Many sentence pairs in EXAM data includes temporal expressions Exploit a feature whether the temporal expressions in S and T have overlap <S>シュマルカルデン同盟とは、1531年に、…</S> ‘Schmalkaldic League is … in 1531.’ <T>16世紀に、ドイツでは、シュマルカルデン同盟…</T> ‘In the 16th century of Germany, Schmalkaldic League … ’ <S>1997年まで東南アジアバブルであったが、…</S> ‘Until 1997 there was an economic bubble in South-east Asia, … (year of) 1531 [1531, 1531] 16th century [1501, 1600] (year of ) 1997 [1997, 1997] <T>1990年代初めに起こった日本でのバブル経済の崩壊が、…</T> ‘The collapse of Japanese bubbled economy happened in the early 1990s, … ’ Early 1990s [1990, 1994] 10 Temporal Match Strong clue for entailment Temporal Mismatch Strong clue for non-entail © 2011 IBM Corporation IBM Research Expansion of the Training Data Convert the training data for the MC subtask as the additional training data for BC subtask, and vice versa. – e.g. Forward entailment label (F) between S and T is equal to the true entailment (Y) for S T and false entailment for T S. Label conversion rule * F,B over arrow denotes either Forward (F) or Bidirectional (B) entailment between S and T (label ambiguities). Enhanced data – BC+MC’ data : 500+880 pairs – MC+BC’ data : 500+440 pairs (To handle label ambiguities, we train logistic regression by marginal log-likelihood maximization) 11 © 2011 IBM Corporation IBM Research Results – BC Subtask Cost Function Cross validation on training data Features Accuracy in formal run Training CV AC w/o edit distance None Word + Sentiment + PAS + Temporal BC 52.8 52.0 IBM BC1 HDM EDO + Word + Sentiment + PAS BC 54.8 52.2 IBM BC2 HDM + BGH EDO + Word + Sentiment + PAS BC 54.0 52.6 IBM BC3 HDM + BGH + WO EDO (POS fine) + Word 64.1 47.0 Oracle BGH EDO + Word + Sentiment + PAS BC 51.8 56.0 Best feature set in formal run 12 BC + MC’ Positive performance performance gain gain with with the the edit edit distance distance method method •• Positive Very low low correlation correlation between between CV CV and and AC AC •• Very → the the results results are are unpredictable unpredictable → BC+MC’ increases increases CV CV by by 8%~13%, 8%~13%, but but no no effect effect for for AC AC •• BC+MC’ © 2011 IBM Corporation IBM Research Results – MC Subtask Cost Function Features Training CV AC w/o edit distance None Word + Sentiment + PAS + Temporal MC 33.6 35.9 IBM MC1 HDM EDO + Word + Sentiment + PAS MC + BC’ 46.8 43.6 IBM MC2 HDM + BGH + WO EDO + Word MC 50.2 50.2 IBM MC3 HDM + BGH + WO EDO (POS fine) + Word MC + BC’ 51.3 44.5 Oracle HDM + Ont + WO EDO + Word + Sentiment MC 51.1 51.6 Achieved high high accuracy accuracy for for 5-fold 5-fold classification classification •• Achieved EDO features features increased increased the the accuracy accuracy by by 10% 10% •• EDO Other pair pair features features tend tend not not to to work work well well •• Other Extended development development data data (MC+BC’) (MC+BC’) was was not not effective effective •• Extended 13 © 2011 IBM Corporation IBM Research Results – EXAM Subtask Cost Function Features Training CV AC None Word + Sentiment + PAS + Temporal EXAM 67.9 69.5 HDM EDO + Word + Sentiment + PAS EXAM 63.7 67.6 IBM EX1 HDM EDO + Word + Sentiment + PAS+ Temporal EXAM 69.1 72.2 IBM EX2 Ontology EDO + Word + Sentiment + PAS+ Temporal EXAM 62.5 67.4 IBM EX3 HDM + BGH + WO+ Ontology EDO (POS fine) + Word + Sentiment + PAS +Temporal EXAM 61.5 58.4 Oracle HDM EDO + Word + Sentiment + PAS+ Temporal EXAM 68.3 72.6 w/o edit distance 14 Relatively small small contribution contribution of of edit edit distance distance •• Relatively Temporal increased increased the the accuracy accuracy by by 5% 5% •• Temporal High correlation correlation between between CV CV and and AC AC •• High © 2011 IBM Corporation IBM Research Summary Achieved good performance in MC and EXAM subtasks –Machine learning approach with various features –Tree edit operations worked as key features (especially in MC task) –Use of thesaurus and ontology – complementary resources Performance is unpredictable – the model is still immature No special treatment for 5-fold classification in MC task – needed more observation 15 © 2011 IBM Corporation IBM Research Backup 16 © 2011 IBM Corporation IBM Research Details of MC results – confusion matrix System Output Correct Label 17 F R B C I F 87 6 21 30 40 R 7 89 20 9 16 B 7 12 30 16 6 C 4 0 1 5 4 I 5 3 3 5 14 © 2011 IBM Corporation IBM Research Results – RITE4QA Subtask Features Cost Function Training AC Top1 MMR 5 w/o edit distance None Word + Sentiment + PAS + Temporal BC 34.5 5.5 19.8 IBM R4QA1 HDM & Ont & WO EDO (POS fine) + Word + Sentiment + PAS BC 33.3 11.3 23.3 IBM R4QA2 HDM & BGH EDO BC 31.6 9.1 21.7 IBM R4QA3 HDM & Ont & WO EDO (POS fine) + Word + BC+MC’ Sentiment + PAS +Temporal 40.1 8.7 22.2 Oracle None Word + Sentiment + PAS + Tempora 63.5 18.1 29.0 18 BC+MC’ © 2011 IBM Corporation