我正在使用xpath和beautifulsoup来抓取网页。 Xpath需要树作为输入,beautifulsoup需要汤作为输入。 这是获得树和汤的代码:
def get_tree(url):
r = requests.get(url)
tree = html.fromstring(r.content)
return tree
# get soup
def get_soup(url):
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
return soup
这两种方法都使用requests.get(url)。这就是我想要存储的内容。 这是python中的代码:
import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)
然后我得到了这样的错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response
这是存储文本的代码,我收到错误:
import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'
我该如何解决这个问题?
答案 0 :(得分:3)
infile = open("html", "rb") #this is a file object Not a string
您需要先使用read()
阅读它,而不仅仅是打开: - ) -
infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)
答案 1 :(得分:0)
requests.get返回一个响应对象。
我想写文需要文字。你想要的是响应的内容,也是文本。
r = requests.get(url).content
答案 2 :(得分:0)
fromstring()
期望一个字符串作为输入。由于您有文件,因此需要使用parse()
:
>>> tree = html.parse(infile)
>>> tree.findtext('//title')
When Divorce Is a Family Affair - Room for Debate - NYTimes.com