我有一个段落的数据框,我有(*可以)分成单词代币和句子代币,我希望找到所有名词短语跟随短语:“贡献”或“捐赠”的任何实例发生。
或者实际上是某种形式,所以:
"Contributions are welcome to be made to the charity of your choice."
---> would return: "the charity of your choice"
和
"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"
---> would return: "ABC Foundation"
我已经创建了一个正则表达式,可以在大约90%的时间内捕获正确的短语...见下文:
text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation
我想清理那个正则表达式以摆脱“{,15}”的要求,因为它缺少了我需要的一些值。但是,我对“贪婪”的表达方式不太满意,无法让它正常工作。
所以这句话:
While she lived a full life , had many achievements and made many
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName
正在返回:“访问兄弟FirstName姓氏”,因为之前提到的贡献,即使15个单词之后“to”这个词出现得很好。
答案 0 :(得分:1)
P3
Example 如果工作并做你需要的,那么让我知道,我将解释我的正则表达式。
答案 1 :(得分:1)
您似乎正在努力解决如何将搜索条件限制为单个句子的问题。因此,只需使用NLTK将文本分解为句子(它可以比仅查看句点更好),并且您的问题就会消失。
function Service(n) {
this.n = n;
}
Service.prototype = {
get: function (params) {
var self = this;
return new Promise(function(resolve, reject){
if (params[self.n]) {
resolve("Service " + self.n);
} else {
reject("Service " + self.n);
}
});
}
}
对于进一步的工作,我还建议您使用比sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example
recipients = []
for sent in sents:
m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
if m:
recipients.append(m.group(2).strip())
更好的工具,该工具用于简单的交互式探索。如果你想对你的文本做更多的事情,那么nltk&#39; Text
就是你的朋友。