使用正则表达式查找特定短语出现后段落中的所有名词短语

时间:2015-12-08 20:35:40

标签: python regex nlp nltk findall

我有一个段落的数据框,我有(*可以)分成单词代币和句子代币,我希望找到所有名词短语跟随短语:“贡献”或“捐赠”的任何实例发生。

或者实际上是某种形式,所以:

"Contributions are welcome to be made to the charity of your choice." 

---> would return: "the charity of your choice"

"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"

---> would return: "ABC Foundation"

我已经创建了一个正则表达式,可以在大约90%的时间内捕获正确的短语...见下文:

text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation

我想清理那个正则表达式以摆脱“{,15}”的要求,因为它缺少了我需要的一些值。但是,我对“贪婪”的表达方式不太满意,无法让它正常工作。

所以这句话:

While she lived a full life , had many achievements and made many 
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName

正在返回:“访问兄弟FirstName姓氏”,因为之前提到的贡献,即使15个单词之后“to”这个词出现得很好。

2 个答案:

答案 0 :(得分:1)

P3

Example 如果工作并做你需要的,那么让我知道,我将解释我的正则表达式。

答案 1 :(得分:1)

您似乎正在努力解决如何将搜索条件限制为单个句子的问题。因此,只需使用NLTK将文本分解为句子(它可以比仅查看句点更好),并且您的问题就会消失。

function Service(n) {
    this.n = n;
}
Service.prototype = {
    get: function (params) {
        var self = this;
        return new Promise(function(resolve, reject){
            if (params[self.n]) {
                resolve("Service " + self.n);
            } else {
                reject("Service " + self.n);
            }
        });
    }
}

对于进一步的工作,我还建议您使用比sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example recipients = [] for sent in sents: m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent) if m: recipients.append(m.group(2).strip()) 更好的工具,该工具用于简单的交互式探索。如果你想对你的文本做更多的事情,那么nltk&#39; Text就是你的朋友。