使用lxml从站点刮取信息

时间:2011-01-16 00:36:45

标签: python html parsing screen-scraping lxml

我正在尝试使用lxml从网站Reddit.com获取所有标题的列表。我使用了这个查询:

  reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
  reddit.xpath("//div[contains(@class,'title')]//b/text()")

然而,当我运行表达式时,Python shell中没有任何内容。 XPath不正确吗?

使用Python 2.7运行

这是完整的代码:

import urllib
import os, random, sys, math
from lxml import etree

def main():

    reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
    reddit.xpath("//div[contains(@class,'title')]//b/text()")



if __name__ == "__main__":
    main()

2 个答案:

答案 0 :(得分:6)

Reddit has API。你不需要刮它。只需在网址末尾添加'.json'

#!/usr/bin/env python
import json
import urllib2

url = "http://www.reddit.com/r/all/top/.json"
data = json.load(urllib2.urlopen(url))
for child in data['data']['children']:
    print child['data']['title']

示例输出

Dear America, I Saw You Naked: And yes, we were laughing. Confessions of an ex-TSA agent
My wife and I are expecting our son in June, so I installed a fiber-optic star ceiling :)
You wouldn't download a car: Honda releases concept car 3D printing files
So my liquor store I managed closed today, the VP came in to collect the liquor but told me "we're not going to resell the beer, we'll be here about an hour fill up your car."
Baby Olinguito (Recently Discovered Species!)
Bower Bird- in a desperate bid for attention from the opposite sex, Bower males build nests, then decorate with objects of a single color. (xpost- /r/everythingscience)
My friend works as a English teacher in Sweden.
My kid's homework, I think the page designer has had enough.
Man Washes up in Marshall Islands 'After 16 Months Adrift' at sea
Kitten plays the air harp
New roommate already started off on a bad note with us.
MRW a program crashes and asks to contact tech support... and I am tech support.
Jack Black just posted this to facebook. "This is fan art. But it's exactly how I remember it."
Looks like Colorado's legalization has caused problems after all. [4]
My new kitten likes to "hold hands." She does this for as long as you offer your finger.
Ahahaha he got you go-wahhhh
Shipwrecked man makes land 'after 16 months adrift'
As someone who's taken math at university
Footage released of Guardian editors destroying Snowden hard drives: GCHQ technicians watched as journalists took angle grinders and drills to computers after weeks of tense negotiations
TIL Mike Tyson offered a zoo attendant $10,000 to open the cage of a bullying gorilla so he could "smash that silverback's snotbox." His offer was declined.
Microsoft being helpful as always
President Barack Obama says in a new interview that he would support efforts to remove marijuana from the federal government’s list of the most serious narcotics, but that Congress must act to make the change.
advisory
Vila Franca's Islet, Azores Archipelago, Portugal [1440x900] - How can that be so spherical?
The dad on my Child Development book is putting the kids helmet on backwards.

答案 1 :(得分:2)

您没有连接到互联网。再试一次。

和/或

您的Python安装要么已被删除,要么您已将两个堆栈跟踪混合在一起...请注意路径突然从3.1更改为2.7 !!!!!!!

<强>更新

shell中没有任何内容,因为您不打印任何内容。

至少代替reddit.xpath("blahblah")而不是:

result = reddit.xpath("blahblah")
print result

你会看到你当前版本的“blahblah”会产生[],并且如果摆弄“blahblah”会改善这种情况,那就要注意了。

相关问题