Question

我无法弄清楚为什么会收到此错误。我正在关注this tutorial以提取实际文本。但我不明白这个错误。

有人可以查看我的代码吗？

import urllib
from bs4 import BeautifulSoup
import re


url = "https://en.wikipedia.org/wiki/Python_(programming_language)" # link of website
html = urllib.urlopen(url).read() # reading and opening link
soup = BeautifulSoup(html) #parsing


for script in soup(["script", "style","a","<div id=\"bottom\" >"]): # all tags
    script.extract()    # clear out


for p in soup.find_all('p'): # loop for printing text
    r = re.sub("<.*?>", "", p) # expression to get rid from <p> <b> etc
    print r

错误：

Traceback (most recent call last):
  File "C:/Users/DELL/Desktop/python/s/fyp/textextractioon.py", line 16, in <module>
    r = re.sub("<.*?>", "", p)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

Answer 1

将您的最终循环更改为：

for p in soup.find_all('p'): # loop for printing text
    r = re.sub("<.*?>", "", p.text) # expression to get rid from <p> <b> etc
    print r

每个p都来自类型'bs4.element.Tag'：并且它有一些内置的方法，看一看就可以了。

python2.7.8：TypeError：带有bs4和re的预期字符串或缓冲区

1 个答案: