python2.7和美丽的汤4(bs4)

时间:2018-01-27 00:46:50

标签: python parsing beautifulsoup

我遇到了另一个问题,我使用的是python2.7,不能使用任何新版本,我也使用python2.7的美丽汤4

无论如何,我的问题是,如何从网站代码的所有..部分中提取以下数据?

<tr id="id-gainers-adtoken-1h">
<td class="text-right">
1
</td>
<td class="no-wrap currency-name">
<img src="https://files.coinmarketcap.com/static/img/coins/16x16/adtoken.png" class="currency-logo" alt="adToken">
<a href="/currencies/adtoken/">adToken</a>
</td>

<td class="text-left">ADT</td>

<td class="no-wrap text-right">
<a href="/currencies/adtoken/#markets" class="volume" data-usd="45657000.0" data-btc="4103.75">$45,657,000</a>
</td>

<td class="no-wrap text-right">
<a href="/currencies/adtoken/#markets" class="price" data-usd="0.198131" data-btc="1.78084e-05">$0.198131</a>
</td>

<td class="no-wrap percent-1h  positive_change  text-right" data-usd="36.36" data-btc="33.02">36.36%</td>
</tr>

我需要以下数据:

"adtoken" and "1h" from first line
<tr id="id-gainers-adtoken-1h">

36.36% from
<td class="no-wrap percent-1h  positive_change  text-right" data-usd="36.36" data-btc="33.02">36.36%</td>

我想在词典列表中收集所有这些值,如下所示:

biggest_gainers = [
{ "name": "adtoken", "timeframe": "1h", "gain": "36.36%" },
{ "name": "spectre-dividend", "timeframe": "1h", "gain": "34.34%" } ]

到目前为止,我的代码正在编写包含&#34;(&#39; tr&#39;)&#34;的html代码的所有内容。到一个文件。从这里我无法弄清楚如何继续,我已尝试多个拆分方案,删除索引,并在逐行读取文件后用[:-3]删除字符串中的最后一个n字符

from bs4 import BeautifulSoup as bs
import urllib2
from time import sleep

url = urllib2.urlopen('https://coinmarketcap.com/gainers-losers/')
soup = bs(url)


print(soup)
with open('somefile.txt', 'a') as f:

    for item in soup('tr'):
        f.write(str(item))
    f.close()

我相信我对此的一般方法是完全错误的,因为我不需要先将其写入文件然后解析该文件。

非常感谢任何想法。

2 个答案:

答案 0 :(得分:0)

不要使用BeautifulSoup来解析CoinMarketCap中的HTML。他们有一个API!

https://api.coinmarketcap.com/v1/ticker/

加载它的最简单方法是:

import requests

requests.get('https://api.coinmarketcap.com/v1/ticker/').json()

答案 1 :(得分:0)

想出来,对于那些试图达到同样目标的人来说,代码如下:

from bs4 import BeautifulSoup as bs
import urllib2
from time import sleep

url = urllib2.urlopen('https://coinmarketcap.com/gainers-losers/')
soup = bs(url, "html.parser")


table_1h = soup.find("div", attrs={"id":"gainers-1h", "class": "tab-pane"})
headings_1h = [th.get_text() for th in table_1h.find("tr").find_all("th")]

table_24h = soup.find("div", attrs={"id":"gainers-24h", "class": "tab-pane"})
headings_24h = [th.get_text() for th in table_24h.find("tr").find_all("th")]

table_7d = soup.find("div", attrs={"id":"gainers-7d", "class": "tab-pane"})
headings_7d = [th.get_text() for th in table_7d.find("tr").find_all("th")]


print("================================================================")
biggest_gainers_1h = []
for row in table_1h.find_all("tr")[1:]:
    dataset = zip(headings_1h, (td.get_text() for td in row.find_all("td")))
    coinInfoName    = str(dataset[1][1].strip('\r\n'))
    coinInfoSymbol  = str(dataset[2][1].strip('\r\n'))
    coinInfoVolume  = str(dataset[3][1].strip('\r\n').replace("$",""))
    coinInfoPrice   = float(dataset[4][1].strip('\r\n').replace("$",""))
    coinInfoPercent =  float(dataset[5][1].strip('\r\n').replace("%",""))

    biggest_gainers_1h.append({"Name": str(coinInfoName), "Symbol": str(coinInfoSymbol), "Volume": str(coinInfoVolume), "Price": float(coinInfoPrice), "Percentage": float(coinInfoPercent)})

for item in biggest_gainers_1h:
    print(item)
# -----------------------------------------------------------------------------
print("================================================================")
biggest_gainers_24h = []
for row in table_24h.find_all("tr")[1:]:
    dataset = zip(headings_24h, (td.get_text() for td in row.find_all("td")))
    coinInfoName    = str(dataset[1][1].strip('\r\n'))
    coinInfoSymbol  = str(dataset[2][1].strip('\r\n'))
    coinInfoVolume  = str(dataset[3][1].strip('\r\n').replace("$",""))
    coinInfoPrice   = float(dataset[4][1].strip('\r\n').replace("$",""))
    coinInfoPercent =  float(dataset[5][1].strip('\r\n').replace("%",""))

    biggest_gainers_24h.append({"Name": str(coinInfoName), "Symbol": str(coinInfoSymbol), "Volume": str(coinInfoVolume), "Price": float(coinInfoPrice), "Percentage": float(coinInfoPercent)})

for item in biggest_gainers_24h:
    print(item)
# -----------------------------------------------------------------------------
print("================================================================")
biggest_gainers_7d = []
for row in table_7d.find_all("tr")[1:]:
    dataset = zip(headings_7d, (td.get_text() for td in row.find_all("td")))
    coinInfoName    = str(dataset[1][1].strip('\r\n'))
    coinInfoSymbol  = str(dataset[2][1].strip('\r\n'))
    coinInfoVolume  = str(dataset[3][1].strip('\r\n').replace("$",""))
    coinInfoPrice   = float(dataset[4][1].strip('\r\n').replace("$",""))
    coinInfoPercent =  float(dataset[5][1].strip('\r\n').replace("%",""))

    biggest_gainers_7d.append({"Name": str(coinInfoName), "Symbol": str(coinInfoSymbol), "Volume": str(coinInfoVolume), "Price": float(coinInfoPrice), "Percentage": float(coinInfoPercent)})

for item in biggest_gainers_7d:
    print(item)
# -----------------------------------------------------------------------------
print("================================================================")