Question

网络抓取的新手。

我需要从页面获取每日观察表（页面末尾的长表）数据：

https://www.wunderground.com/history/daily/us/tx/greenville/KGVT/date/2015-01-05?cm_ven=localwx_history

表格的html从""

开始

我的代码是：

df[c("transistor_count", "doi")] <- lapply(df[c("transistor_count", "doi")], 
                 function(x) sub("\\[\\d+\\]", "", x))

输出为：

<table _ngcontent-c16="" class="tablesaw-sortable" id="history-observation-table">

因此它没有得到表并返回No Recorded Recorded，但确实得到了标题。

当我尝试

url = "https://www.wunderground.com/history/daily/us/tx/greenville/KGVT/date/2015-01-05?cm_ven=localwx_history"
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
soup.findAll(class_="region-content-observation")

或

[<div class="region-content-observation">
 <city-history-observation _nghost-c34=""><div _ngcontent-c34="">
 <div _ngcontent-c34="" class="observation-title">Daily Observations</div>
 <!-- -->
     No Data Recorded

   <!-- -->
 </div></city-history-observation>
 </div>]

它仅返回空列表。

有人知道哪里出了问题吗？

Answer 1

如果在Firefox中打开网页，则可以使用开发人员工具中的网络标签查看所有下载的不同Web资源。您感兴趣的数据实际上是由this JSON file提供的，可以使用Python的json库进行检索和解析。

注意：我从未刮过使用API密钥的网站，因此我不确定这种情况下的道德规范或最佳做法。作为测试，我能够下载JSON文件而没有任何问题。但是，我怀疑Weather Underground不想让您多次使用他们的密钥-看起来他们no longer provide free weather API keys。

从网络获取表格时，BeautifulSoup返回未记录的数据

1 个答案: