我正在从一个站点解析前100名列表的信息,该站点跟踪前1000种硬币的加密硬币价格或使用xlml的类似数字。万一我追踪的前100名中的第一个跌至前100名以下并最终进入第二页,如何将第二页添加到树中?链接到我的代码:https://github.com/cbat971/CoinScraping/blob/master/WebCrawl.py
我尝试制作一个“ page2”变量,在页面变量中添加“,”,在页面变量中添加“ +”。
from lxml import html
import requests
import datetime
import time
page = requests.get('https://coinmarketcap.com/', 'https://coinmarketcap.com/2')
tree = html.fromstring(page.content)
如果我在列表上拥有的所有100个硬币都在第一页上,那没有问题。但是,一旦有人将其推送到第二页,就会出现错误,并且最后没有任何钱币通过for
语句进行处理。
答案 0 :(得分:1)
您可以尝试使用串联两个HTML
page1.content + page2.content
但是它不起作用,因为lxml
只期望一个<html>
和一个<body>
,并且它将仅解析第一页并跳过其他页面。
运行代码,您只会得到一个`
from lxml import html
import requests
page1 = requests.get('https://coinmarketcap.com/')
page2 = requests.get('https://coinmarketcap.com/2')
tree = html.fromstring(page1.content + page2.content)
print(tree.cssselect('body'))
您必须分别处理每个页面-读取,解析页面并从HTML获取值-并将结果添加到一个列表/词典
此代码给出了两个<body>
from lxml import html
import requests
for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
page = requests.get(url)
tree = html.fromstring(page.content)
print(tree.cssselect('body'))
编辑:
from lxml import html
import requests
data = {
'BTC': 'id-bitcoin',
'TRX': 'id-tron',
# ...
'HC': 'id-hypercash',
'XZC': 'id-zcoin',
}
all_results = {}
for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
page = requests.get(url)
tree = html.fromstring(page.content)
print(tree.cssselect('body'))
for key, val in data.items():
result = tree.xpath('//*[@id="' + val + '"]/td[4]/a/text()')
print(key, result)
if result:
all_results[key] = result[0]
print('---')
print(all_results)
结果:
[<Element body at 0x7f6ba576cd68>]
BTC ['$6144.33']
TRX ['$0.023593']
HC []
XZC []
[<Element body at 0x7f6ba57fb4f8>]
BTC []
TRX []
HC ['$1.05']
XZC ['$6.25']
---
{'BTC': '$6144.33', 'TRX': '$0.023593', 'HC': '$1.05', 'XZC': '$6.25'}