将过滤的ngram写入outfile - 列表列表

时间:2017-02-07 06:31:37

标签: python file nlp nltk n-gram

我按照某种模式从一堆HTML文件中提取了三个图。当我打印它们时,我得到一个列表列表(每行是3克)。我想将它打印到outfile以进行进一步的文本分析,但是当我尝试它时,它只打印前三个。如何将所有三个图打印到outfile? (三个图表列表)。理想情况下,我希望将所有三个图表合并到一个列表中,而不是将多个列表与一个三个图表合并。非常感谢您的帮助。

到目前为止,我的代码看起来像这样:

from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation) 

# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')   
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
    mytext = f.read()  

# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')

# Split the text in sentences (using the NLTK sentence splitter) 
sentences = sent_tokenize(extracted_text)

# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []

for sent in sentences:
    sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
    tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
    all_tokens.extend(tokenized_sent) # add tagged tokens to list

n=3
threegrams = ngrams(all_tokens, n)


# Find ngrams with specific pattern
for (first, second, third) in threegrams: 
    if first == "a":
        if second.endswith("bb") and second.startswith("leg"):
            print(first, second, third)

1 个答案:

答案 0 :(得分:0)

首先,删除标点符号可能更简单,请参阅Removing a list of characters in string

>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'

但在进行标记化之前删除标点符号并不正确,您会看到We'll - > Well,我认为这是不可取的。

这可能是一种更好的方法:

>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

但请注意,上面的成语不能处理多字符标点符号。

E.g。 ,我们发现word_tokenize()更改" - > ``,并使用上面的成语并没有删除它:

>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]

要处理这个问题,请将punctuation明确地添加到列表中,并将多字符标点符号附加到其中:

>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']

至于获取文档流(正如你所说的那样all_tokens),这是获得它的一种巧妙方法:

>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

现在谈谈你实际问题的一部分。

你真正需要的是不检查ngram中的字符串,而应该考虑正则表达式模式匹配。

您想要找到模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b,请参阅https://regex101.com/r/zBVgp4/4

>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]

现在要将字符串写入文件,您可以使用此习语,请参阅https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function

with open('filename.txt', 'w') as fout:
    print('Hello World', end='\n', file=fout)

事实上,如果您只对没有令牌的ngrams感兴趣,则无需过滤或标记文本; P

你可以简单地使用你的代码:

soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"

with open('filename.txt', 'w') as fout:
    for interesting_ngram in re.findall(pattern, extracted_text):
        print(interesting_ngram, end='\n', file=fout)