如何使用python和lxml提取表值?

时间:2017-03-14 01:28:22

标签: python-2.7 web-scraping lxml

我需要提取IP地址列表和端口号以及以下html表中的其他信息,我目前正在使用带有lxml的python 2.7,但不知道如何找到这些元素的正确路径,

这是表的地址: link to table

3 个答案:

答案 0 :(得分:0)

如果proxy.IP,proxy.PORT和proxy.country值位于相同的[n]单元格位置,您可以通过在tr行中指定td [n]的位置来设置它:

from lxml import html

webpage = html.parse('lxml_test.html')

ip = webpage.xpath('//tr[@class="ng-scope"]/td[2]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[3]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[4]/text()')

或者,如果您专门在单元名称之后:

ip = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.IP"]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.PORT"]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.country"]/text()')

编辑:要从网页获取HTML代码,请使用请求模块:

import requests
page = requests.get('https : //hidester.com/proxylist/')
webpage = html.fromstring(page.content)

答案 1 :(得分:0)

假设表中有多行,您可以找到每一行然后提取数据。

import lxml.etree

doc = lxml.etree.parse('test.xml')

# We need to locate the <tr> objects somehow... I'm assuming
# there is a single <table><tbody>.. container and no other
# span/div tags in the way.

for tr in doc.xpath('//table/tbody[1]/tr'):
    proxy_ip = tr.xpath('td[@ng-bind="proxy.IP"]/text()')[0].strip()
    proxy_port = tr.xpath('td[@ng-bind="proxy.PORT"]/text()')[0].strip()
    proxy_country = tr.xpath('td[@ng-bind="proxy.country"]/text()')[0].strip()
    print(proxy_ip, proxy_port, proxy_country)

答案 2 :(得分:0)

BeautifulSoup会有所帮助。

from bs4 import BeautifulSoup
import requests
import re
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/51.0.2704.103 Safari/537.36'}

url = str(raw_input("Enter URL: "))

req = requests.get(url,headers= header) #if the site dont require a request 
                                        #you dont have to ask for
html = req.text                         #if you dont want to ask for a 
                                        #request use mechanize module

soup = BeautifulSoup(html,'html.parser')

for ip in soup.findAll("td",{"ng-binding":"proxy.IP"}):
    print "IP:      ", ip.get_text()

for ip_p in soup.findAll("td", {"ng-bind":"proxy.PORT"}):
    print "PORT:    ", ip_p.get_text()

for ip_c in soup.findAll("td", {"ng-bind":"proxy.country"}):
    print "COUNTRY: ", ip_c.get_text()