Question

我需要分析一些文本以用于机器学习目的。我认识的数据科学家建议我使用pattern.en作为我的项目。

我会给我的程序一个关键字（示例：披萨），它必须从我给他的几个文本中排序一些“趋势”。（示例：我给他发了一些关于比萨饼上的花生酱的文章，所以该程序会发现花生酱是一种成长趋势。）

所以首先，我必须“清理”文本。我知道 pattern.en 可以将单词识别为名词，动词，副词等。我想删除所有的决定者，文章和其他“毫无意义”的单词供我分析，但我不知道怎么做。我试着parse()所以我可以得到：

s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
parsedS = parse(s)
print(parsedS)

输出：

Hello/UH/hello ,/,/, how/WRB/how is/VBZ/be it/PRP/it going/VBG/go ?/./?
I/PRP/i am/VBP/be tired/VBN/tire actually/RB/actually ,/,/, did/VBD/do not/RB/not sleep/VB/sleep enough/RB/enough .../:/...
That/DT/that is/VBZ/be bad/JJ/bad for/IN/for work/NN/work ,/,/, definitely/RB/definitely

所以我想删除带有“UH”，“，”，“PRP”等标签的单词，但我不知道怎么做，而且不会弄乱句子（为了分析目的，我会在我的示例）

中忽略没有“pizza”一词的句子

我不知道我的解释是否非常清楚，如果你不理解某事，请随时问我。

编辑 - 更新：在 canyon289 的回答之后，我想逐句完成，而不是整篇文章。我试过了：

for sentence in Text(s):
    sentence = sentence.split(" ")
    print("SENTENCE :")
    for word in sentence:
        if not any(tag in word for tag in dont_want):
            print(word)

但我有以下错误：

AttributeError: 'Sentence' object has no attribute 'split'

我该如何解决这个问题？

Answer 1

这应该对你有用

s = "Hello, how is it going ? I am tired actually, did not sleep   enough... That is bad for work, definitely"
s = parse(s)

#Create a list of all the tags you don't want
dont_want = ["UH", "PRP"]

sentence = parse(s).split(" ")

#Go through all the words and look for any occurence of the tag you don't want
#This is done through a nested list comprehension
[word for word in sentence if not any(tag in word for tag in dont_want)]

[u＆＃39;，/，/ O / O＆＃39;，u＆＃39; how / WRB / O / O＆＃39;，u＆＃39;是/ VBZ / B-VP / O＆＃39; ，你去/ VBG / B-VP / O＆＃39;，你好/ VBP / B-VP / O＆＃39;，你好/ VBN / I-VP / O＆＃39;，你实际上是/ RB / B-ADVP / O＆＃39; ，你＆＃39;，/，/ O / O＆＃39;，你做了/ VBD / B-VP / O＆＃39;，你没有/ RB / I-VP / O＆＃39;，你＆＃39;睡眠/ VB / I-VP / O＆＃39 ;, 你足够/ RB / B-ADVP / O＆＃39;，你＆＃39; ... /：/ O / O \ n / DT / O / O＆＃39;，你＆＃39;是/ VBZ / B-VP / O＆＃39 ;, 你好/ JJ / B-ADJP / O＆＃39;，你对/ IN / B-PP / B-PNP＆＃39;，u＆＃39; work / NN / B-NP / I- PNP＆＃39 ;, 你＆＃39;，/，/ O / O＆＃39;，你绝对/ RB / B-ADVP / O＆＃39;]

使用pattern.en格式化整个文本？

1 个答案: