我需要提取IP地址列表和端口号以及以下html表中的其他信息,我目前正在使用带有lxml的python 2.7,但不知道如何找到这些元素的正确路径,
这是表的地址: link to table
答案 0 :(得分:0)
如果proxy.IP,proxy.PORT和proxy.country值位于相同的[n]单元格位置,您可以通过在tr行中指定td [n]的位置来设置它:
from lxml import html
webpage = html.parse('lxml_test.html')
ip = webpage.xpath('//tr[@class="ng-scope"]/td[2]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[3]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[4]/text()')
或者,如果您专门在单元名称之后:
ip = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.IP"]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.PORT"]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.country"]/text()')
编辑:要从网页获取HTML代码,请使用请求模块:
import requests
page = requests.get('https : //hidester.com/proxylist/')
webpage = html.fromstring(page.content)
答案 1 :(得分:0)
假设表中有多行,您可以找到每一行然后提取数据。
import lxml.etree
doc = lxml.etree.parse('test.xml')
# We need to locate the <tr> objects somehow... I'm assuming
# there is a single <table><tbody>.. container and no other
# span/div tags in the way.
for tr in doc.xpath('//table/tbody[1]/tr'):
proxy_ip = tr.xpath('td[@ng-bind="proxy.IP"]/text()')[0].strip()
proxy_port = tr.xpath('td[@ng-bind="proxy.PORT"]/text()')[0].strip()
proxy_country = tr.xpath('td[@ng-bind="proxy.country"]/text()')[0].strip()
print(proxy_ip, proxy_port, proxy_country)
答案 2 :(得分:0)
BeautifulSoup会有所帮助。
from bs4 import BeautifulSoup
import requests
import re
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/51.0.2704.103 Safari/537.36'}
url = str(raw_input("Enter URL: "))
req = requests.get(url,headers= header) #if the site dont require a request
#you dont have to ask for
html = req.text #if you dont want to ask for a
#request use mechanize module
soup = BeautifulSoup(html,'html.parser')
for ip in soup.findAll("td",{"ng-binding":"proxy.IP"}):
print "IP: ", ip.get_text()
for ip_p in soup.findAll("td", {"ng-bind":"proxy.PORT"}):
print "PORT: ", ip_p.get_text()
for ip_c in soup.findAll("td", {"ng-bind":"proxy.country"}):
print "COUNTRY: ", ip_c.get_text()