Question

我正在使用xpath和beautifulsoup来抓取网页。 Xpath需要树作为输入，beautifulsoup需要汤作为输入。这是获得树和汤的代码：

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

这两种方法都使用requests.get（url）。这就是我想要存储的内容。这是python中的代码：

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

然后我得到了这样的错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

这是存储文本的代码，我收到错误：

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

我该如何解决这个问题？

Answer 1

infile = open("html", "rb") #this is a file object Not a string

您需要先使用read()阅读它，而不仅仅是打开： - ） -

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)

Answer 2

requests.get返回一个响应对象。

我想写文需要文字。你想要的是响应的内容，也是文本。

r = requests.get(url).content

Answer 3

fromstring()期望一个字符串作为输入。由于您有文件，因此需要使用parse()：

>>> tree = html.parse(infile)
>>> tree.findtext('//title')
When Divorce Is a Family Affair - Room for Debate - NYTimes.com

在python中存储html

3 个答案: