Question

我想解析跨越多个页面的表（或多个表）。我在下面的方式工作，但是太过手动，我希望它能从不同的页面自动解析表并将它们合并为一个。页数可能并不总是相同。

  </head>

  <body class="claro">
    <div id="mainWindow" data-dojo-type="dijit/layout/BorderContainer" data-dojo-props="design:'headline'" style="width:100%; height:100%;">
      <div id="header" class="shadow roundedCorners" data-dojo-type="dijit/layout/ContentPane" data-dojo-props="region:'top'">
        <div id="title"></div>
        <div id="subtitle"></div>
      </div>
      <div id="map" class="roundedCorners shadow" data-dojo-type="dijit/layout/ContentPane" data-dojo-props="region:'center'"></div>
      <div id="rightPane" class="roundedCorners shadow" data-dojo-type="dijit/layout/ContentPane" data-dojo-props="region:'right'" >
        <div id="legend"></div>
      </div>
    </div>
  </body>
</html>

请注意，网址仅在“page = X”中有所不同。网页本身也包含指向例如的链接。下一页。

Answer 1

results = {}
for page_num in range(1, 10): #change depending on max page
    address = 'https://rittresultater.no/nb/sb_tid/923?page=' + \
               str(page_num) + '&pv2=11027&pv1=U' 

    html = urlopen(address)
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find_all(class='table-condensed')
    output = pd.read_html(str(table))[0]
    results[page_num] = output

当它完成时使用列表理解来输出相关的东西，如果它是你代码中的最后一行但按比例放大，那么这样做：

df = pd.concat([v for v in results.values()], axis = 0)

如何使用Python自动解析跨越多个页面的表

1 个答案: