Question

我完全是nltk和python的新手。我被赋予了从网址中提取所有文本的任务。我已经尝试过并且能够在阅读nltk文档后从指定的URL中提取文本。我主要关注的是如何从提取的列表中删除特殊字符（如。， - ，“”，“，！，）。下面提到的代码不适用于html网页的<li> </li>标记内的文字。因此，始终将点.附加到<li>标记内文本的最后一个单词。非常感谢任何帮助。源代码如下。

from bs4 import BeautifulSoup 
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
response = urllib.request.urlopen('https://en.wikipedia.org/wiki/Electronics') 
f=open('corpus.txt','w+')
html = response.read() 
soup = BeautifulSoup(html,"html.parser") 
text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
clean_tokens = tokens[:] 
sr = stopwords.words('english') 
for token in tokens: 
if token in sr: 
   clean_tokens.remove(token) 
   freq = nltk.FreqDist(clean_tokens) 
   for normalize,val in freq.items(): 
       lemmatizer=WordNetLemmatizer()
       corpus_refi=lemmatizer.lemmatize(str(normalize) + ':' + str(val), pos="a")
       corpus_refi=corpus_refi.lower()
       print(corpus_refi)

Answer 1

我不确定我是否正确理解了您的问题，但如果您想检测标点符号，您可以执行类似的操作。

from string import punctuation
punc = set(punctuation)
# then inside your for loop, you can either skip if
if token not in punc:

如果令牌包含多个字符，其中一个是标点符号。要删除它，您可以执行类似

的操作

token = translate(token.maketrans('', '', string.punctuation))

如何在Python中删除列表的标点符号？

1 个答案: