Python请求html,但网页中没有完整的数组-url = cmegroup

时间:2019-02-14 02:07:58

标签: python html-table beautifulsoup request

我会就非常具体的问题与您联系,而我很难找到解决方案。 我认为它非常具体,因为它可以与许多其他网站一起使用,但不适用于我感兴趣的网站。

我基本上想找出数组中的值。 如果按照我反复阅读的示例进行操作,我会得到这种有效的代码:

import pandas as pd 
import time
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup

def fDf_htmlGetArray():
        #Example of URL found on internet
        str_url = "http://www.data.jma.go.jp/obd/stats/etrn/view/monthly_s3_en.php?block_no=47401&view=1"

        d_headers = {'User-Agent': 'Mozilla/5.0'}  
        o_page = requests.get(str_url, headers = d_headers)
        bs_soup = BeautifulSoup(o_page.content, "html.parser")      
        # I tried as well without success: # lxml   # html5lib

        for o_table in bs_soup.find_all('table'):
            for o_row in o_table.find_all('tr'):
                print('-----row------')
                print(o_row)
                o_ths = [o_th.text.strip() for o_th in o_row.find_all('th')]
                if not o_ths == []:
                    print('-----th------')
                    print(o_ths)            
                o_cells = [o_cell.text.strip() for o_cell in o_row.find_all('td')]
                if not o_cells == []:
                    print('-----cell------')
                    print(o_cells)  
    fDf_htmlGetArray()

但是我想要的网址是:

str_url = "https://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_quotes_settlements_futures.html"

打开网页并进行检查时,我得到了:

<tr class>
 <th scope="row">MAR19</th>
 <td>53.35</td>
 <td>...</td>
</tr>

我显然想获得 td 数据。 但是当我执行请求时,我只有列名,而在td上什么也没有:

-----table------
-----row------
<tr>
<th class="cmeSettlementsFuturesMonth" scope="col">Month</th>
<th class="cmeSettlementsFuturesOpen" scope="col">Open</th>
<th class="cmeSettlementsFuturesHigh" scope="col">High</th>
<th class="cmeSettlementsFuturesLow" scope="col">Low</th>
<th class="cmeSettlementsFuturesLast" scope="col">Last</th>
<th class="cmeSettlementsFuturesChange" scope="col">Change</th>
<th class="cmeSettlementsFuturesSettle" scope="col">Settle</th>
<th class="cmeSettlementsFuturesEstimatedVolume" scope="col">Estimated Volume</th>
<th class="cmeSettlementsFuturesPriorDayOpenInterest" scope="col">Prior Day Open Interest</th>
</tr>
-----th------
['Month', 'Open', 'High', 'Low', 'Last', 'Change', 'Settle', 'Estimated Volume', 'Prior Day Open Interest']
-----row------
<tr>
<td class="cmeTableFoot" colspan="9">
<ul class="cmeLegend">
<li class="cmeSupportingLinks cmeSupportingLinkIcon cmeAboutListIcon"><a href="../../about-settlements.html" rel="popup"><span>About This Report</span></a></li>
</ul>
</td>
</tr>
-----cell------
['About This Report']

我尝试了许多其他代码。我不知道该如何解决。

你是我最后的希望。

2 个答案:

答案 0 :(得分:1)

您不需要硒。之所以看不到td标签,是因为它们是通过ajax调用填充的。您唯一要做的就是使用请求拨打电话。

加载页面时,我们可以使用浏览器中检查工具上的“网络”标签找到此Ajax调用。我们可以看到数据以json的形式返回。

import requests
from bs4 import BeautifulSoup
import pandas as pd
date='02/14/2019'
url=f'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate={date}'
r= requests.get(url)
df=pd.DataFrame(r.json()['settlements'])
print(df)

输出

    change    high    last     low   month   open openInterest settle   volume
0     UNCH   54.68   53.58   53.57  MAR 19  53.96      163,621      -  247,818
1     UNCH   55.08   54.02   54.01  APR 19  54.31      346,296      -   95,896
2     UNCH   55.67   54.66  54.64A  MAY 19  54.95      230,913      -   39,197
3     UNCH  56.21B   55.26   55.24  JUN 19  55.43      252,325      -   34,879
4     UNCH  56.64B   55.75   55.75  JLY 19  55.93      136,107      -   14,379
5     UNCH  56.97B   56.17   56.17  AUG 19  56.33       80,681      -    5,624
6     UNCH   57.24  56.46B   56.43  SEP 19  56.58       96,819      -    6,291
7     UNCH   57.35   56.62   56.60  OCT 19  56.78       65,096      -    2,180
8     UNCH  57.45B   56.69   56.65  NOV 19  56.90       51,416      -    2,466
9     UNCH   57.46   56.64   56.62  DEC 19  56.75      189,518      -   15,807
10    UNCH  57.31B  56.84A   56.75  JAN 20  56.75       45,573      -      671
11    UNCH  57.33B   56.78   56.78  FEB 20  57.20       23,674      -      403
12    UNCH  56.69B  56.69B       -  MAR 20      -       50,453      -      744
13    UNCH  56.58B  56.58B       -  APR 20      -       11,320      -       46
14    UNCH       -       -       -  MAY 20      -       10,906      -       29
15    UNCH  56.85B   56.10  56.09A  JUN 20  56.45       60,037      -    2,562
16    UNCH  56.19B   55.98   55.98  JLY 20  55.98        9,515      -       26
17    UNCH       -       -       -  AUG 20      -        6,674      -       22
18    UNCH  55.99B  55.99B       -  SEP 20      -       19,131      -       44
19    UNCH  55.86B  55.86B       -  OCT 20      -        8,727      -        0
20    UNCH  55.77B  55.77B       -  NOV 20      -        7,518      -        0
21    UNCH  56.19B   55.39   55.39  DEC 20  55.75      103,630      -    4,972
22    UNCH       -       -       -  JAN 21      -        8,169      -        0
23    UNCH       -       -       -  FEB 21      -        3,172      -        0
24    UNCH       -       -       -  MAR 21      -        3,816      -        0
25    UNCH       -       -       -  APR 21      -        4,837      -        0
26    UNCH       -       -       -  MAY 21      -        2,702      -        0
27    UNCH   55.39   54.76   54.76  JUN 21  55.22       16,767      -       75
28    UNCH       -       -       -  JLY 21      -        4,621      -        0
29    UNCH       -       -       -  AUG 21      -        2,641      -        0
..     ...     ...     ...     ...     ...    ...          ...    ...      ...
79    UNCH       -       -       -  OCT 25      -            0      -        0
80    UNCH       -       -       -  NOV 25      -            0      -        0
81    UNCH       -       -       -  DEC 25      -          201      -        0
82    UNCH       -       -       -  JAN 26      -            0      -        0
83    UNCH       -       -       -  FEB 26      -            0      -        0
84    UNCH       -       -       -  MAR 26      -            0      -        0
85    UNCH       -       -       -  APR 26      -            0      -        0
86    UNCH       -       -       -  MAY 26      -            0      -        0
87    UNCH       -       -       -  JUN 26      -            0      -        0
88    UNCH       -       -       -  JLY 26      -            0      -        0
89    UNCH       -       -       -  AUG 26      -            0      -        0
90    UNCH       -       -       -  SEP 26      -            0      -        0
91    UNCH       -       -       -  OCT 26      -            0      -        0
92    UNCH       -       -       -  NOV 26      -            0      -        0
93    UNCH       -       -       -  DEC 26      -            7      -        0
94    UNCH       -       -       -  JAN 27      -            0      -        0
95    UNCH       -       -       -  FEB 27      -            0      -        0
96    UNCH       -       -       -  MAR 27      -            0      -        0
97    UNCH       -       -       -  APR 27      -            0      -        0
98    UNCH       -       -       -  MAY 27      -            0      -        0
99    UNCH       -       -       -  JUN 27      -            0      -        0
100   UNCH       -       -       -  JLY 27      -            0      -        0
101   UNCH       -       -       -  AUG 27      -            0      -        0
102   UNCH       -       -       -  SEP 27      -            0      -        0
103   UNCH       -       -       -  OCT 27      -            0      -        0
104   UNCH       -       -       -  NOV 27      -            0      -        0
105   UNCH       -       -       -  DEC 27      -            0      -        0
106   UNCH       -       -       -  JAN 28      -            0      -        0
107   UNCH       -       -       -  FEB 28      -            0      -        0
108                                  Total           2,074,304         475,282

[109 rows x 9 columns]

答案 1 :(得分:0)

尝试一下:

<A>
    <a>1</a>
</A>

您需要使用Selenium,因为价格是使用JavaScript动态生成的。无法使用protected void hideKeyboard() { final Activity activity = getActivity(); final View view = activity != null ? activity.getCurrentFocus() : null; new Handler().postDelayed(new Runnable() { @Override public void run() { if (view != null) { InputMethodManager imm = (InputMethodManager) activity.getSystemService(Context.INPUT_METHOD_SERVICE); if (imm != null) imm.hideSoftInputFromWindow(view.getWindowToken(), 0); } } }, 1); } @Override public void onDismiss(DialogInterface dialog) { super.onDismiss(dialog); hideKeyboard(); } 或熊猫人的from selenium import webdriver browser = webdriver.Firefox() browser.get('https://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_quotes_settlements_futures.html') page = browser.page_source cme_tables = pd.read_html(page) 方法之类的工具抓取动态生成的内容。

这里是硒python api的链接。