我会就非常具体的问题与您联系,而我很难找到解决方案。 我认为它非常具体,因为它可以与许多其他网站一起使用,但不适用于我感兴趣的网站。
我基本上想找出数组中的值。 如果按照我反复阅读的示例进行操作,我会得到这种有效的代码:
import pandas as pd
import time
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
def fDf_htmlGetArray():
#Example of URL found on internet
str_url = "http://www.data.jma.go.jp/obd/stats/etrn/view/monthly_s3_en.php?block_no=47401&view=1"
d_headers = {'User-Agent': 'Mozilla/5.0'}
o_page = requests.get(str_url, headers = d_headers)
bs_soup = BeautifulSoup(o_page.content, "html.parser")
# I tried as well without success: # lxml # html5lib
for o_table in bs_soup.find_all('table'):
for o_row in o_table.find_all('tr'):
print('-----row------')
print(o_row)
o_ths = [o_th.text.strip() for o_th in o_row.find_all('th')]
if not o_ths == []:
print('-----th------')
print(o_ths)
o_cells = [o_cell.text.strip() for o_cell in o_row.find_all('td')]
if not o_cells == []:
print('-----cell------')
print(o_cells)
fDf_htmlGetArray()
但是我想要的网址是:
str_url = "https://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_quotes_settlements_futures.html"
打开网页并进行检查时,我得到了:
<tr class>
<th scope="row">MAR19</th>
<td>53.35</td>
<td>...</td>
</tr>
我显然想获得 td 数据。 但是当我执行请求时,我只有列名,而在td上什么也没有:
-----table------
-----row------
<tr>
<th class="cmeSettlementsFuturesMonth" scope="col">Month</th>
<th class="cmeSettlementsFuturesOpen" scope="col">Open</th>
<th class="cmeSettlementsFuturesHigh" scope="col">High</th>
<th class="cmeSettlementsFuturesLow" scope="col">Low</th>
<th class="cmeSettlementsFuturesLast" scope="col">Last</th>
<th class="cmeSettlementsFuturesChange" scope="col">Change</th>
<th class="cmeSettlementsFuturesSettle" scope="col">Settle</th>
<th class="cmeSettlementsFuturesEstimatedVolume" scope="col">Estimated Volume</th>
<th class="cmeSettlementsFuturesPriorDayOpenInterest" scope="col">Prior Day Open Interest</th>
</tr>
-----th------
['Month', 'Open', 'High', 'Low', 'Last', 'Change', 'Settle', 'Estimated Volume', 'Prior Day Open Interest']
-----row------
<tr>
<td class="cmeTableFoot" colspan="9">
<ul class="cmeLegend">
<li class="cmeSupportingLinks cmeSupportingLinkIcon cmeAboutListIcon"><a href="../../about-settlements.html" rel="popup"><span>About This Report</span></a></li>
</ul>
</td>
</tr>
-----cell------
['About This Report']
我尝试了许多其他代码。我不知道该如何解决。
你是我最后的希望。
答案 0 :(得分:1)
您不需要硒。之所以看不到td
标签,是因为它们是通过ajax调用填充的。您唯一要做的就是使用请求拨打电话。
加载页面时,我们可以使用浏览器中检查工具上的“网络”标签找到此Ajax调用。我们可以看到数据以json的形式返回。
import requests
from bs4 import BeautifulSoup
import pandas as pd
date='02/14/2019'
url=f'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate={date}'
r= requests.get(url)
df=pd.DataFrame(r.json()['settlements'])
print(df)
输出
change high last low month open openInterest settle volume
0 UNCH 54.68 53.58 53.57 MAR 19 53.96 163,621 - 247,818
1 UNCH 55.08 54.02 54.01 APR 19 54.31 346,296 - 95,896
2 UNCH 55.67 54.66 54.64A MAY 19 54.95 230,913 - 39,197
3 UNCH 56.21B 55.26 55.24 JUN 19 55.43 252,325 - 34,879
4 UNCH 56.64B 55.75 55.75 JLY 19 55.93 136,107 - 14,379
5 UNCH 56.97B 56.17 56.17 AUG 19 56.33 80,681 - 5,624
6 UNCH 57.24 56.46B 56.43 SEP 19 56.58 96,819 - 6,291
7 UNCH 57.35 56.62 56.60 OCT 19 56.78 65,096 - 2,180
8 UNCH 57.45B 56.69 56.65 NOV 19 56.90 51,416 - 2,466
9 UNCH 57.46 56.64 56.62 DEC 19 56.75 189,518 - 15,807
10 UNCH 57.31B 56.84A 56.75 JAN 20 56.75 45,573 - 671
11 UNCH 57.33B 56.78 56.78 FEB 20 57.20 23,674 - 403
12 UNCH 56.69B 56.69B - MAR 20 - 50,453 - 744
13 UNCH 56.58B 56.58B - APR 20 - 11,320 - 46
14 UNCH - - - MAY 20 - 10,906 - 29
15 UNCH 56.85B 56.10 56.09A JUN 20 56.45 60,037 - 2,562
16 UNCH 56.19B 55.98 55.98 JLY 20 55.98 9,515 - 26
17 UNCH - - - AUG 20 - 6,674 - 22
18 UNCH 55.99B 55.99B - SEP 20 - 19,131 - 44
19 UNCH 55.86B 55.86B - OCT 20 - 8,727 - 0
20 UNCH 55.77B 55.77B - NOV 20 - 7,518 - 0
21 UNCH 56.19B 55.39 55.39 DEC 20 55.75 103,630 - 4,972
22 UNCH - - - JAN 21 - 8,169 - 0
23 UNCH - - - FEB 21 - 3,172 - 0
24 UNCH - - - MAR 21 - 3,816 - 0
25 UNCH - - - APR 21 - 4,837 - 0
26 UNCH - - - MAY 21 - 2,702 - 0
27 UNCH 55.39 54.76 54.76 JUN 21 55.22 16,767 - 75
28 UNCH - - - JLY 21 - 4,621 - 0
29 UNCH - - - AUG 21 - 2,641 - 0
.. ... ... ... ... ... ... ... ... ...
79 UNCH - - - OCT 25 - 0 - 0
80 UNCH - - - NOV 25 - 0 - 0
81 UNCH - - - DEC 25 - 201 - 0
82 UNCH - - - JAN 26 - 0 - 0
83 UNCH - - - FEB 26 - 0 - 0
84 UNCH - - - MAR 26 - 0 - 0
85 UNCH - - - APR 26 - 0 - 0
86 UNCH - - - MAY 26 - 0 - 0
87 UNCH - - - JUN 26 - 0 - 0
88 UNCH - - - JLY 26 - 0 - 0
89 UNCH - - - AUG 26 - 0 - 0
90 UNCH - - - SEP 26 - 0 - 0
91 UNCH - - - OCT 26 - 0 - 0
92 UNCH - - - NOV 26 - 0 - 0
93 UNCH - - - DEC 26 - 7 - 0
94 UNCH - - - JAN 27 - 0 - 0
95 UNCH - - - FEB 27 - 0 - 0
96 UNCH - - - MAR 27 - 0 - 0
97 UNCH - - - APR 27 - 0 - 0
98 UNCH - - - MAY 27 - 0 - 0
99 UNCH - - - JUN 27 - 0 - 0
100 UNCH - - - JLY 27 - 0 - 0
101 UNCH - - - AUG 27 - 0 - 0
102 UNCH - - - SEP 27 - 0 - 0
103 UNCH - - - OCT 27 - 0 - 0
104 UNCH - - - NOV 27 - 0 - 0
105 UNCH - - - DEC 27 - 0 - 0
106 UNCH - - - JAN 28 - 0 - 0
107 UNCH - - - FEB 28 - 0 - 0
108 Total 2,074,304 475,282
[109 rows x 9 columns]
答案 1 :(得分:0)
尝试一下:
<A>
<a>1</a>
</A>
您需要使用Selenium,因为价格是使用JavaScript动态生成的。无法使用protected void hideKeyboard() {
final Activity activity = getActivity();
final View view = activity != null ? activity.getCurrentFocus() : null;
new Handler().postDelayed(new Runnable() {
@Override
public void run() {
if (view != null) {
InputMethodManager imm = (InputMethodManager) activity.getSystemService(Context.INPUT_METHOD_SERVICE);
if (imm != null)
imm.hideSoftInputFromWindow(view.getWindowToken(), 0);
}
}
}, 1);
}
@Override
public void onDismiss(DialogInterface dialog) {
super.onDismiss(dialog);
hideKeyboard();
}
或熊猫人的from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_quotes_settlements_futures.html')
page = browser.page_source
cme_tables = pd.read_html(page)
方法之类的工具抓取动态生成的内容。
这里是硒python api的链接。