我试图从光盘上存储的HTML文档中创建单词列表。当我试图分开单词并将它们添加到我的单词向量时,我最终弄得一团糟。
def get_word_vector(self):
line = self.soup.get_text()
re.sub("s/(\\u[a-e0-9][a-e0-9][a-e0-9]//|\\n)","",line)
for word in line.split("\s+"):
for the_word in word.split("[,.\"\\/?!@#$%^&*\{\}\[\]]+"):
if the_word not in self.word_vector:
self.word_vector[the_word]=0
self.word_vector[the_word]+=1
self.doc_length=self.doc_length+1
for keys in self.word_vector:
print "%r: %r" % (keys, self.word_vector[keys]) #So I can see whats happening
在维基页面上测试时,我得到了(小样本):
"Horse Markings"\n"Horse and Pony Head Markings"\n"Horse and Pony Leg Markings"\n"Identifying Horse parts and markings," Adapted From: Horses For Dummies, 2nd Edition.\n\n\n\n\n\n\n[hide]
作为一个单词"。该文档正在阅读BS4,如:
self.soup = BeautifulSoup(open(fullpath,"r"))
我不明白为什么会这样。我想正则表达式失败了,因为它错了???
答案 0 :(得分:2)
只是一个替代选项:通过get_text()
获取文字,然后使用nltk.tokenize
获取文字中的字词列表。这里的重点不是重新发明轮子并使用专门的工具来完成特定的工作:BeautifulSoup
用于HTML解析,nltk
用于文本处理:
from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
tokenizer = RegexpTokenizer(r'\w+')
print tokenizer.tokenize(soup.get_text())
打印:
[u'Stack', u'Overflow', u'Wikipedia', u'the', u'free', u'encyclopedia', ... ]