- 作者: 陳信希; 顏世傑
- 作者服務機構: 台灣大學資訊工程學系
- 中文摘要: 有關中文句子剖析的研究已經相當普遍,但是對於超長句的處理仍然很少。這篇論文將針對中文的特性提出一套中文超長句的剖析策略。造成超長句剖析困難的主要原因有下列幾點:(1)標點符號的不規則使用,(2)片語類型在剖析之前無法預先確定,(3)這些片語之間的結合關係非常複雜,(4)空詞的高度使用,(5)剖析樹之建構不易。 對於中文超長句的處理,我們先將整個句子依標點符號,分成一些較小的段落,每個段落獨立剖析。在小段落的剖析中,嘗試找到最大映射節點。接下來,再根據標點符號、段落類型、連接元素、主題鏈等語言知識,將這些獨立剖析的段落組成完整的剖析樹。歧義現象會造成一個句子有數種剖析樹;而標點符號的誤用,也可能將一個句子切成多個獨立的剖析樹,每個剖析樹自成一個完整的單元。 我們以Prolog語言和C語言,在Sun工作站研製這套剖析系統。系統核心部分由Prolog程式構成,以圖型聯併模式製作;字典建構、視窗控制和剖析樹的顯示,則由C程式完成。
- 英文摘要: Parsers for Mandarin Chinese sentences have been proposed by many researchers. However, how todeal with the parsing of very long sentences is seldom touched. The difficulties in analyzing long sen-tences are:(1)punctuation marks are used at random; (2) the phrasal types between punctuation marks areunclear before parsing;(3)the association of these phrases is very complex;(4) several constituents areoften omitted in long sentences;(5) it is not easy to construct the parsing trees.This paper considers thespecific features of Chinese sentences and proposes a new parsing system for Mandarin Chinese. At first, a sentence is separated into several segments according to the punctuation marks in thesentence. The segments are analyzed independently.We try to find a maximal projection node to covereach segment. Next, the linguistic knowledge such as punctuation marks, segment categories, linkingelements and topic chains is applied in a predetermined order to link these segments.Finally, the parsingtrees of these segments are composed. Because of the ambiguities, more than one parsing tree for a sen-tence may be generated. The random uses of punctuation marks may also result in several independentparsing trees.Each corresponds to a part of the sentence. The parsing system running on a Sun Sparc Station has been implemented using Prolog and C. TheProlog codes on the basis of the graph unification model form the kernel of the system. The C programssupport dictionary maintenance, window controls and the display of the parsing trees.
- 中文關鍵字: Chinese language processing; empty category; graph unification; parsing algorithm
- 英文關鍵字: --