Question

我试图从光盘上存储的HTML文档中创建单词列表。当我试图分开单词并将它们添加到我的单词向量时，我最终弄得一团糟。

 def get_word_vector(self):
  line = self.soup.get_text()
  re.sub("s/(\\u[a-e0-9][a-e0-9][a-e0-9]//|\\n)","",line)
  for word in line.split("\s+"):
   for the_word in word.split("[,.\"\\/?!@#$%^&*\{\}\[\]]+"):
    if the_word not in self.word_vector:
     self.word_vector[the_word]=0
    self.word_vector[the_word]+=1
    self.doc_length=self.doc_length+1
  for keys in self.word_vector:
   print "%r: %r" % (keys, self.word_vector[keys]) #So I can see whats happening

在维基页面上测试时，我得到了（小样本）：

"Horse Markings"\n"Horse and Pony Head Markings"\n"Horse and Pony Leg Markings"\n"Identifying Horse parts and markings," Adapted From: Horses For Dummies, 2nd Edition.\n\n\n\n\n\n\n[hide]

作为一个单词＆＃34;。该文档正在阅读BS4，如：

  self.soup = BeautifulSoup(open(fullpath,"r"))

我不明白为什么会这样。我想正则表达式失败了，因为它错了???

Answer 1

只是一个替代选项：通过get_text()获取文字，然后使用nltk.tokenize获取文字中的字词列表。这里的重点不是重新发明轮子并使用专门的工具来完成特定的工作：BeautifulSoup用于HTML解析，nltk用于文本处理：

from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer

soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
tokenizer = RegexpTokenizer(r'\w+')
print tokenizer.tokenize(soup.get_text())

打印：

[u'Stack', u'Overflow', u'Wikipedia', u'the', u'free', u'encyclopedia', ... ]

Python re.sub，re.split在长篇文章中无法分割单词

1 个答案: