Question

我有一个段落的数据框，我有（*可以）分成单词代币和句子代币，我希望找到所有名词短语跟随短语：“贡献”或“捐赠”的任何实例发生。

或者实际上是某种形式，所以：

"Contributions are welcome to be made to the charity of your choice." 

---> would return: "the charity of your choice"

和

"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"

---> would return: "ABC Foundation"

我已经创建了一个正则表达式，可以在大约90％的时间内捕获正确的短语...见下文：

text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation

我想清理那个正则表达式以摆脱“{，15}”的要求，因为它缺少了我需要的一些值。但是，我对“贪婪”的表达方式不太满意，无法让它正常工作。

所以这句话：

While she lived a full life , had many achievements and made many 
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName

正在返回：“访问兄弟FirstName姓氏”，因为之前提到的贡献，即使15个单词之后“to”这个词出现得很好。

Answer 1

P3

Example 如果工作并做你需要的，那么让我知道，我将解释我的正则表达式。

Answer 2

您似乎正在努力解决如何将搜索条件限制为单个句子的问题。因此，只需使用NLTK将文本分解为句子（它可以比仅查看句点更好），并且您的问题就会消失。

function Service(n) {
    this.n = n;
}
Service.prototype = {
    get: function (params) {
        var self = this;
        return new Promise(function(resolve, reject){
            if (params[self.n]) {
                resolve("Service " + self.n);
            } else {
                reject("Service " + self.n);
            }
        });
    }
}

对于进一步的工作，我还建议您使用比sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example recipients = [] for sent in sents: m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent) if m: recipients.append(m.group(2).strip())更好的工具，该工具用于简单的交互式探索。如果你想对你的文本做更多的事情，那么nltk＆＃39; Text就是你的朋友。

使用正则表达式查找特定短语出现后段落中的所有名词短语

2 个答案: