在python中存储html

时间:2014-11-07 21:50:37

标签: python html beautifulsoup lxml lxml.html

我正在使用xpath和beautifulsoup来抓取网页。 Xpath需要树作为输入,beautifulsoup需要汤作为输入。 这是获得树和汤的代码:

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

这两种方法都使用requests.get(url)。这就是我想要存储的内容。 这是python中的代码:

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

然后我得到了这样的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

这是存储文本的代码,我收到错误:

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

我该如何解决这个问题?

3 个答案:

答案 0 :(得分:3)

infile = open("html", "rb") #this is a file object Not a string

您需要先使用read()阅读它,而不仅仅是打开: - ) -

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)

答案 1 :(得分:0)

requests.get返回一个响应对象。

我想写文需要文字。你想要的是响应的内容,也是文本。

r = requests.get(url).content

答案 2 :(得分:0)

fromstring()期望一个字符串作为输入。由于您有文件,因此需要使用parse()

>>> tree = html.parse(infile)
>>> tree.findtext('//title')
When Divorce Is a Family Affair - Room for Debate - NYTimes.com