我在使用lxml进行抓取时遇到了一些问题 我刚刚制作了一个工作正常的代码,但我有两个问题
我想要名字和地址在同一行,每个条目都应该在不同的行中,如
name1,adress1
name2,adress2
我不需要数据中的任何方括号
import lxml.html as lh
from selenium import webdriver
browser = webdriver.Firefox()
from lxml import html
for cod in ("35211","36116","36542"):
browser.get('http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='+cod)
content = browser.page_source
tree = lh.fromstring(content)
name=tree.xpath('//tr/td/span[@class="largecol"]/text()')
adress=tre.xpath('//tr/td/span[@class="smallcol"]/text()')
print(name,adress)
答案 0 :(得分:1)
您无需使用lxml
,selenium
确实提供find_elements_by_xpath
。
使用zip
匹配姓名和地址。
打开文本文件并迭代以获取行;使用str.strip
获取代码。
from selenium import webdriver
browser = webdriver.Firefox()
url = 'http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='
with open('1.txt') as f:
for line in f:
cod = line.strip()
browser.get(url+cod)
name = browser.find_elements_by_xpath('//tr/td/span[@class="largecol"]')
address = browser.find_elements_by_xpath('//tr/td/span[@class="smallcol"]')
name = [n for n in name if n.text.strip()] # Remove empty names
for n, a in zip(name, address):
print(n.text, a.text)
如果你想要的不受javascript的影响,你只能使用lxml
。
import lxml.html
url = 'http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='
with open('1.txt') as f:
for line in f:
cod = line.strip()
tree = lxml.html.parse(url+cod)
name = tree.xpath('//tr/td/span[@class="largecol"]/text()')
address = tree.xpath('//tr/td/span[@class="smallcol"]/text()')
name = [n for n in name if n.strip()]
for n, a in zip(name, address):
print(n, a)