如何检测'缩写中使用的和作为引号之间的区别

时间:2012-05-09 21:38:41

标签: ruby regex parsing

我正在尝试解析文本块,需要一种方法来检测不同上下文中的撇号之间的区别。在一组中占有和缩写,在另一组中引用。

e.g。

  

“我是车主” - > [“我是”,“the”,“汽车”,“所有者”]

  

“他说'你好'' - > [“他”,“说”,“'你好那里'”]

检测任何一方的空白都无济于事,因为“'ello”和“cars”会将其解析为引用的一端,与匹配的撇号对相同。我感觉除了一个非常复杂的NLP解决方案之外别无他法,我只是要忽略任何没有出现在中间词的撇号,这是不幸的。

编辑:

自写作以来,我意识到这是不可能的。任何基于正则表达式的解析器都必须解析:

  'ello那里是我的伙伴'狗

以两种不同的方式,只能通过理解句子的其余部分来做到这一点。我猜是因为忽略了最不可能的案例,并且希望它很少见,只会造成偶发事件。

3 个答案:

答案 0 :(得分:0)

嗯,我担心这不容易。这是一个有效的正则表达式,唉只有“我是”和“我”这样的东西:

>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"

如果你多玩一点,你可以消除一些其他常见的收缩,这可能仍然比没有好。

答案 1 :(得分:0)

要考虑的一些规则:

  • 引号将以带有空格字符的撇号开头,或者之前没有任何内容。
  • 行情将以带有标点符号或空白字符的撇号结束。
  • 某些字词可能看起来像引号的结尾,例如peoples'
  • 引用分隔符撇号永远不会在它们之前和之后直接包含字母。

答案 2 :(得分:0)

使用一个非常简单的两阶段过程。

在第1段的第2页中,从这个正则表达式开始,将文本分解为单词和非单词字符的交替段。

/(\w+)|(\W+)/gi

将匹配项存储在这样的列表中(我使用AS3样式的伪代码,因为我不使用ruby):

class MatchedWord
{
    var text:String;
    var charIndex:int;
    var isWord:Boolean;
    var isContraction:Boolean = false;
    function MatchedWord( text:String, charIndex:int, isWord:Boolean )
    {
        this.text = text; this.charIndex = charIndex; this.isWord = isWord;
    }
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
    matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)

在2的第2遍中,通过检查每个(修剪的,非单词)是否与撇号匹配ENDS来迭代匹配列表以查找收缩。如果是,则检查下一个相邻(单词)匹配,看它是否只匹配8个常见收缩结尾之一。尽管我能想到的所有两部分收缩,但只有8个共同的结局。

d
l
ll
m
re
s
t
ve

一旦你确定了这样一对匹配(非单词)=“'”和(单词)=“d”,那么你只需要包含前面的相邻(单词)匹配并连接三个匹配以获得你的收缩。

了解刚刚描述的过程,您必须进行的一项修改是扩展收缩结束列表,以包括以撇号开头的收缩,例如“'twas”和“tis”。对于那些,你只是不连接前面的相邻(单词)匹配,你更仔细地看一下撇号匹配,看看它是否包含其他非单词字符(这就是为什么它以撇号结束的重要性) )。如果修剪后的字符串EQUALS为撇号,则将其与下一个匹配合并,如果它仅与撇号结束,则剥离撇号并将其​​与后续匹配合并。同样,包含先前匹配的条件应首先检查以确保(修剪的非单词)匹配以撇号EQUALS结尾为撇号,因此不会意外包含额外的非单词字符。

您可能需要做的另一个修改是扩展8个结尾的列表,以包括整个单词的结尾,例如“g'day”和“g'night”。同样,这是一个简单的修改,涉及前一个(单词)匹配的条件检查。如果它是“g”,那么你包括它。

这个过程应该捕捉大部分的收缩,并且足够灵活,包括你能想到的新的收缩。

数据结构如下所示。

Condition(Ending, PreCondition)

前提是

"*", "!", or "<exact string>"

最终的条件列表如下所示:

new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");

如果您按照我的解释处理这些条件,那应该涵盖所有这86个收缩(以及更多):

  

't''twas不是不是不可能不可能没有不也没有   每个人的g'day g'night都没有,他不知道他会怎么样   怎么样,我怎么样,我就是我,我不是它,它是不是它是我们的利益   可能也许不一定不要没有人不会,她不会   她应该不应该那样那就是那就是他们的   他们是他们,他们不是我们,我们,我们,我们,我们不是什么   什么是什么时候什么时候什么时候会在哪里   在哪里谁是谁谁是谁谁是谁为什么会这样做   为什么不会,你不是你,你,你,你是

另一方面,不要忘记不使用撇号的俚语收缩,例如“gotta”&gt; “得到”和“要去”&gt; “去”。

这是最终的AS3代码。总的来说,您正在查看少于50行代码,以将文本解析为交替的单词和非单词组,并识别和合并收缩。简单。你甚至可以在MatchedWord类中添加一个布尔“isContraction”变量,并在识别出收缩时在下面的代码中设置标志。

//Automatically merge known contractions
var conditions:Array = [
    ["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
    ["l","*"],
    ["ll","*"],
    ["m","*"],
    ["re","*"],
    ["s","*"],
    ["t","*"],
    ["ve","*"],
    ["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
    ["tis","!"],
    ["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
    ["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
    var m:MatchedWord = matched_words[i];
    var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
    if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
    {
        var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
        var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
        for each (var condition:Array in conditions)
        {
            if (StringUtils.trim( m_next.text ) == condition[0])
            {
                var pre_condition:String = condition[1];
                switch (pre_condition)
                {
                    case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
                        if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                    case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
                        if (apostrophe_text == "'")
                        {
                            m.text += m_next.text;
                            m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
                            m.isContraction = true;
                            matched_words.splice( i + 1, 1 );
                        }
                        else
                        {   //strip apostrophe off end and merge with next item, nothing needs deleted
                            //preserve spaces and match start indexes by manipulating untrimmed strings
                            var apostrophe_end:int = m.text.lastIndexOf( "'" );
                            var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
                            m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
                            m_next.text = apostrophe_ending + m_next.text;
                            m_next.charIndex = m.charIndex + apostrophe_end;
                            m_next.isContraction = true;
                        }
                        break;
                    default: //conditional success, check prior match meets condition
                        if (m_prev != null && m_prev.text == pre_condition)
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                }
            }
        }
    }
}