Question

所以我正在进行比特币检查练习，而且我在抓取数据时遇到问题，因为我想要的数据是跨类，我不知道如何检索数据。

所以这是我从inspect获得的那一行：

 <span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>

我想刮掉＆＃34; 11,511.31＆＃34;数。我该怎么做呢？我尝试了很多不同的东西，老实说我不知道该怎么做了。

这是网址：link

我正在刮取目前的美元价格（紧靠＆＃34; BTC / USD＆＃34;）

编辑：你给我的很多例子是我输入数据的地方。这不是很有用，因为我想每30秒刷新一次页面，所以我需要程序找到span类并提取数据并打印出来＆＃39;

编辑：当前代码。需要让程序获得＆＃34; html＆＃34;本身就是

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup

url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself 
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser") 
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
    print(span.text.replace('USD','').strip())

Answer 1

您只需要搜索正确的标签和类 -

from bs4 import BeautifulSoup

html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""

html = BeautifulSoup(html_text, "lxml")

spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
    print(span.text.replace('USD', '').strip())

搜索所有<span>代码，然后按class属性对其进行过滤，在这种情况下，该属性的值为MarketInfo_market-num_1lAXs。完成过滤后，只需循环浏览跨度，然后使用.text属性检索文本，然后只需替换“＆＃39; USD＆＃39;。

<强>更新

import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])

无需了解HTML。该HTML标记中的数据将从具有JSON响应的API调用中填充。您可以直接调用该API。这将使您的数据保持最新状态。

Answer 2

您可以使用beautifulsoup或lxml。

对于beautifulsoup，代码如下

from bs4 import BeautifulSoup

soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")

print(soup.string)

lxml更快

from lxml import etree

span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")

for i in span.xpath("//span/text()"):
    print(i)

Answer 3

尝试像Selenium-Firefox这样的真实浏览器。我试图使用Selenium-PhantomJS，但我失败了......

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

url = 'https://www.gdax.com/trade/BTC-USD'

driver = webdriver.Firefox(executable_path='./geckodriver')

driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...

html = driver.page_source
soup = BeautifulSoup(html, "lxml") 
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
    print(span.text.replace('USD','').strip())

driver.close()

输出：

11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]

如何在span类中使用python Scrape文本

3 个答案: