我按照某种模式从一堆HTML文件中提取了三个图。当我打印它们时,我得到一个列表列表(每行是3克)。我想将它打印到outfile以进行进一步的文本分析,但是当我尝试它时,它只打印前三个。如何将所有三个图打印到outfile? (三个图表列表)。理想情况下,我希望将所有三个图表合并到一个列表中,而不是将多个列表与一个三个图表合并。非常感谢您的帮助。
from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation)
# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
mytext = f.read()
# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')
# Split the text in sentences (using the NLTK sentence splitter)
sentences = sent_tokenize(extracted_text)
# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []
for sent in sentences:
sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
all_tokens.extend(tokenized_sent) # add tagged tokens to list
threegrams = ngrams(all_tokens, n)
# Find ngrams with specific pattern
for (first, second, third) in threegrams:
if first == "a":
if second.endswith("bb") and second.startswith("leg"):
print(first, second, third)
答案 0 :(得分:0)
首先,删除标点符号可能更简单,请参阅Removing a list of characters in string
>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'
- > Well
>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
E.g。 ,我们发现word_tokenize()
- > ``,并使用上面的成语并没有删除它:
>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> sent = 'He said, "There is no room for room"'
>>> punctuation
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']
>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
with open('filename.txt', 'w') as fout:
print('Hello World', end='\n', file=fout)
事实上,如果您只对没有令牌的ngrams感兴趣,则无需过滤或标记文本; P
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"
with open('filename.txt', 'w') as fout:
for interesting_ngram in re.findall(pattern, extracted_text):
print(interesting_ngram, end='\n', file=fout)