我对lxml和xpaths了解不多,我想学习如何从网站上抓取数据。当我运行此代码时,我没有得到任何结果,也不知道为什么。请帮我解决。
代码
from lxml import html
import requests
pageLen=str(100)
page = requests.get('http://www.yellowpages.com/search?search_terms=lawyer&geo_location_terms=usa&page=2')
print(page)
tree = html.fromstring(page.content)
#phoneNumber = tree.xpath('//span[@class="c411Phone"]/text()')
Link=tree.xpath('//div[@class="info"]/a/@href')
Bname=tree.xpath('//a[@class="business-name"]/text()')
print(Bussiness_names)
print(Bname)
HTML CODE
答案 0 :(得分:0)
谢谢你@Abd Azrad。你的解决方案给了我很多帮助。
你能进一步指导我吗?我很困惑如何处理不一致的数据?
有时,邮件地址丢失,有时位置丢失。我只是想忽略那些不符合我要求的数据
page = requests.get('http://www.yellowpages.com/search?search_terms=%s&geo_location_terms=%s&page=%s'%("lawyer","toronot","2"))
tree = html.fromstring(page.text)
bus_names=tree.xpath('//a[@class="business-name"]/text()')
print bus_names
##bus_url=tree.xpath('//a[@class="business-name"]/href()')
##print bus_url
street_ad=tree.xpath('//span[@class="street-address"]/text()')
print(street_ad)
loc=tree.xpath('//span[@class="locality"]/text()')
print(loc)
postal=tree.xpath('//span[@itemprop="postalCode"]/text()')
print(postal)
contact=tree.xpath('//div[@class="phones phone primary"]/text()')
print(contact)
通过这种方式我得到了列表,但由于列表长度不同,我无法保存跟踪数据。有没有办法获取列表中每个人的数据以及2d列表[[person_one_name,person_one_address],[person_two_name,person_two_contact]]形式的所有数据?
答案 1 :(得分:-1)
from lxml import html
import requests
url = 'http://www.yellowpages.com/search?search_terms=lawyer&geo_location_terms=usa&page=2'
page = requests.get(url)
tree = html.fromstring(page.text)
tree.make_links_absolute(url)
for business in tree.xpath('//a[@class="business-name"]'):
print business.attrib['href'], business.text