Question

我正试图为此表抓取http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts，并将Company Name，Code和industry返回到列表中，对于它的所有15个页面

我一直在努力与lxml.html，xpath和Beautifulsoup合作以尝试获取此信息，但我被卡住了。

我意识到这些信息似乎是嵌入在网站中的#html，但我不确定如何构建模块来检索它。

有什么想法？或者，如果我应该使用不同的模块/技术？

修改

我发现此链接已嵌入网站，其中包含我之前谈到的#html：http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts

当我尝试使用Beautifulsoup取出数据时：

r = requests.get('http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts')
wb = BeautifulSoup(r.text, "html.parser")
print(wb.findAll('div', attrs={'class': 'table-wrapper results-display'}))

它返回以下结果：

[<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>]

但这与网站上的不同。有什么想法吗？

Answer 1

您可能希望以另一种方式解决此问题。

通过查看服务器调用（chrome - ＆gt; F12 - ＆gt;网络标签），您可以确定应该实际调用哪个网址来获取json响应。显然，你可以使用这样开头的网址： http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search?callback=json&json= ???? （你需要做一些逆向工程来弄清楚实际的json查询，但看起来并不太难）对不起，我没有进一步了解json查询，但我希望这有助于你继续前进：）

注意：我的答案基于url

#!/usr/bin/env python
import requests

url = "http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search"

params = {
    'callback': 'json',
    'json': {
        # key / value pairs defining your actual query to the server
        # you need to figure this out yourself depending on the data you want
        # to retrieve.
        # I usually look at chrome's network tab (F12), find the proper URL
        # that queries for the data, reverse engineer the key/value pairs
    }
}

response = requests.get(url, params)
print(response.json())

无法通过lxml，xpath（未答复）

1 个答案: