我按照某种模式从一堆HTML文件中提取了三个图。当我打印它们时,我得到一个列表列表(每行是3克)。我想将它打印到outfile以进行进一步的文本分析,但是当我尝试它时,它只打印前三个。如何将所有三个图打印到outfile? (三个图表列表)。理想情况下,我希望将所有三个图表合并到一个列表中,而不是将多个列表与一个三个图表合并。非常感谢您的帮助。
到目前为止,我的代码看起来像这样:
from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation)
# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
mytext = f.read()
# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')
# Split the text in sentences (using the NLTK sentence splitter)
sentences = sent_tokenize(extracted_text)
# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []
for sent in sentences:
sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
all_tokens.extend(tokenized_sent) # add tagged tokens to list
n=3
threegrams = ngrams(all_tokens, n)
# Find ngrams with specific pattern
for (first, second, third) in threegrams:
if first == "a":
if second.endswith("bb") and second.startswith("leg"):
print(first, second, third)
答案 0 :(得分:0)
首先,删除标点符号可能更简单,请参阅Removing a list of characters in string
>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'
但在进行标记化之前删除标点符号并不正确,您会看到We'll
- > Well
,我认为这是不可取的。
这可能是一种更好的方法:
>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
但请注意,上面的成语不能处理多字符标点符号。
E.g。 ,我们发现word_tokenize()
更改"
- > ``,并使用上面的成语并没有删除它:
>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
要处理这个问题,请将punctuation
明确地添加到列表中,并将多字符标点符号附加到其中:
>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']
至于获取文档流(正如你所说的那样all_tokens
),这是获得它的一种巧妙方法:
>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
现在谈谈你实际问题的一部分。
你真正需要的是不检查ngram中的字符串,而应该考虑正则表达式模式匹配。
您想要找到模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b
,请参阅https://regex101.com/r/zBVgp4/4
>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]
现在要将字符串写入文件,您可以使用此习语,请参阅https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function:
with open('filename.txt', 'w') as fout:
print('Hello World', end='\n', file=fout)
事实上,如果您只对没有令牌的ngrams感兴趣,则无需过滤或标记文本; P
你可以简单地使用你的代码:
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"
with open('filename.txt', 'w') as fout:
for interesting_ngram in re.findall(pattern, extracted_text):
print(interesting_ngram, end='\n', file=fout)