请帮我lxml

时间:2014-09-14 07:54:57

标签: python selenium web-scraping lxml

我在使用lxml进行抓取时遇到了一些问题 我刚刚制作了一个工作正常的代码,但我有两个问题

  1. 我想要名字和地址在同一行,每个条目都应该在不同的行中,如

    name1,adress1
    name2,adress2
    
  2. 我不需要数据中的任何方括号

    1. 我必须输入500个代码,所以我想从外部text / csv文件导入它 请帮帮我怎么办

    2. import lxml.html as lh
      
      from selenium import webdriver
      
      browser = webdriver.Firefox()
      
      from lxml import html
      
      
      for cod in ("35211","36116","36542"):
      
           browser.get('http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='+cod)
      
           content = browser.page_source
      
           tree = lh.fromstring(content)
      
           name=tree.xpath('//tr/td/span[@class="largecol"]/text()')
      
           adress=tre.xpath('//tr/td/span[@class="smallcol"]/text()')
      
      
           print(name,adress)
      

1 个答案:

答案 0 :(得分:1)

您无需使用lxmlselenium确实提供find_elements_by_xpath

使用zip匹配姓名和地址。

打开文本文件并迭代以获取行;使用str.strip获取代码。


from selenium import webdriver

browser = webdriver.Firefox()
url = 'http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='

with open('1.txt') as f:
    for line in f:
        cod = line.strip()
        browser.get(url+cod)
        name = browser.find_elements_by_xpath('//tr/td/span[@class="largecol"]')
        address = browser.find_elements_by_xpath('//tr/td/span[@class="smallcol"]')
        name = [n for n in name if n.text.strip()]  # Remove empty names
        for n, a in zip(name, address):
            print(n.text, a.text)

如果你想要的不受javascript的影响,你只能使用lxml

import lxml.html

url = 'http://kmbsapps.konicaminolta.us/wheretobuy/main_search.jspx?productCategory=Office+Systems&sl_zip='

with open('1.txt') as f:
    for line in f:
        cod = line.strip()
        tree = lxml.html.parse(url+cod)
        name = tree.xpath('//tr/td/span[@class="largecol"]/text()')
        address = tree.xpath('//tr/td/span[@class="smallcol"]/text()')
        name = [n for n in name if n.strip()]
        for n, a in zip(name, address):
            print(n, a)