Question

我是python和网络抓取的新手。我想这很简单，但我无法使其正常工作。

我用烧瓶创建了一个本地网页，该烧瓶包含一个表和一个填充该表的函数。下一步是在其他计算机上获取此数据。这是我尝试过的：

import requests
from bs4 import BeautifulSoup

requests.get('http://127.0.0.1:5000')
soup = BeautifulSoup(source_code)

这就是我得到的：

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://www.w3schools.com/w3css/4/w3.css" rel="stylesheet"/>
<link href="https://use.fontawesome.com/releases/v5.7.0/css/all.css" rel="stylesheet"/>
<title>Title</title>
</head>
<body>
<div class="w3-container w3-teal">
<h1>SpaceWire Devices</h1>
</div>
<div>
<table class="w3-table-all w3-large" id="device_table">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tr class="w3-blue">
<th>Device ID</th>
<th colspan="2">Channel 1 (Left)</th>
<th colspan="2">Channel 2 (Left)</th>
</tr>
</table>
</div>
<script>
        //first add an event listener for page load
        document.addEventListener( "DOMContentLoaded", get_json_data, false ); // get_json_data is the function name that will fire on page load

        //this function is in the event listener and will execute on page load
        function get_json_data(){
            // Relative URL of external json file
            var json_url = '/status';

            //Build the XMLHttpRequest (aka AJAX Request)
            xmlhttp = new XMLHttpRequest();
            xmlhttp.onreadystatechange = function() {
                if (this.readyState == 4 && this.status == 200) {//when a good response is given do this

                    var data = JSON.parse(this.responseText); // convert the response to a json object
                    append_json(data);// pass the json object to the append_json function
                }
            }
            //set the request destination and type
            xmlhttp.open("get", json_url, true);
            //set required headers for the request
            // xmlhttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
            // send the request
            xmlhttp.send(); // when the request completes it will execute the code in onreadystatechange section
        }

        //this function appends the json data to the table 'gable'
        function append_json(data){
            var table = document.getElementById('device_table');
            for (var device in data) {
                var tr = document.createElement('tr');
                tr.innerHTML = '<td>' + data[device].id + '</td>' +
                '<td>' + data[device].channel_1 + '</td>' +
                '<td>' + data[device].channel_1_port + '</td>' +
                '<td>' + data[device].channel_2 + '</td>' +
                '<td>' + data[device].channel_2_port + '</td>'
                table.appendChild(tr);
            };
        }
    </script>
</body>
</html>

我真正想要的是最终由append_json（）创建的数据。我该怎么做？

Answer 1

Jammy Dodger关于硒的评论是正确的。 html由js生成。您的要求不允许该代码像在浏览器中那样执行。我将使用selenium打开页面，然后以这种方式检索DOM。在这里，您可以浏览和抓取所需的数据。它看起来应该像这样。

  from selenium import webdriver
  from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
  from selenium.webdriver.firefox.options import Options

  options = Options()
  options.add_argument("--headless")
  binary = FirefoxBinary('/usr/lib/firefox/firefox')
  browser = webdriver.Firefox(firefox_options=options, firefox_binary=binary)
  url = f'https://www.rottentomatoes.com/search/?search={title}'
  try:
    browser.get(url)
    # Give the js a little bit of time to generate the html
    time.sleep(1)
    html = browser.page_source
    browser.quit()
    soup = BeautifulSoup(html, 'lxml')

Answer 2

您应该调用soup.get_text（）或获得特定的标签：soup.find（“ a”）甚至是汤中的i，也可以是i.get_text（）

我正在尝试抓取网页，但得到的是功能而不是实际数据

2 个答案: