Python re.sub,re.split在长篇文章中无法分割单词

时间:2014-08-09 02:04:45

标签: python regex python-2.7 beautifulsoup

我试图从光盘上存储的HTML文档中创建单词列表。当我试图分开单词并将它们添加到我的单词向量时,我最终弄得一团糟。

 def get_word_vector(self):
  line = self.soup.get_text()
  re.sub("s/(\\u[a-e0-9][a-e0-9][a-e0-9]//|\\n)","",line)
  for word in line.split("\s+"):
   for the_word in word.split("[,.\"\\/?!@#$%^&*\{\}\[\]]+"):
    if the_word not in self.word_vector:
     self.word_vector[the_word]=0
    self.word_vector[the_word]+=1
    self.doc_length=self.doc_length+1
  for keys in self.word_vector:
   print "%r: %r" % (keys, self.word_vector[keys]) #So I can see whats happening

在维基页面上测试时,我得到了(小样本):

"Horse Markings"\n"Horse and Pony Head Markings"\n"Horse and Pony Leg Markings"\n"Identifying Horse parts and markings," Adapted From: Horses For Dummies, 2nd Edition.\n\n\n\n\n\n\n[hide]

作为一个单词"。该文档正在阅读BS4,如:

  self.soup = BeautifulSoup(open(fullpath,"r"))

我不明白为什么会这样。我想正则表达式失败了,因为它错了???

1 个答案:

答案 0 :(得分:2)

只是一个替代选项:通过get_text()获取文字,然后使用nltk.tokenize获取文字中的字词列表。这里的重点不是重新发明轮子并使用专门的工具来完成特定的工作:BeautifulSoup用于HTML解析,nltk用于文本处理:

from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer

soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
tokenizer = RegexpTokenizer(r'\w+')
print tokenizer.tokenize(soup.get_text())

打印:

[u'Stack', u'Overflow', u'Wikipedia', u'the', u'free', u'encyclopedia', ... ]