我是python和网络抓取的新手。我想这很简单,但我无法使其正常工作。
我用烧瓶创建了一个本地网页,该烧瓶包含一个表和一个填充该表的函数。 下一步是在其他计算机上获取此数据。这是我尝试过的:
import requests
from bs4 import BeautifulSoup
requests.get('http://127.0.0.1:5000')
soup = BeautifulSoup(source_code)
这就是我得到的:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://www.w3schools.com/w3css/4/w3.css" rel="stylesheet"/>
<link href="https://use.fontawesome.com/releases/v5.7.0/css/all.css" rel="stylesheet"/>
<title>Title</title>
</head>
<body>
<div class="w3-container w3-teal">
<h1>SpaceWire Devices</h1>
</div>
<div>
<table class="w3-table-all w3-large" id="device_table">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tr class="w3-blue">
<th>Device ID</th>
<th colspan="2">Channel 1 (Left)</th>
<th colspan="2">Channel 2 (Left)</th>
</tr>
</table>
</div>
<script>
//first add an event listener for page load
document.addEventListener( "DOMContentLoaded", get_json_data, false ); // get_json_data is the function name that will fire on page load
//this function is in the event listener and will execute on page load
function get_json_data(){
// Relative URL of external json file
var json_url = '/status';
//Build the XMLHttpRequest (aka AJAX Request)
xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {//when a good response is given do this
var data = JSON.parse(this.responseText); // convert the response to a json object
append_json(data);// pass the json object to the append_json function
}
}
//set the request destination and type
xmlhttp.open("get", json_url, true);
//set required headers for the request
// xmlhttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
// send the request
xmlhttp.send(); // when the request completes it will execute the code in onreadystatechange section
}
//this function appends the json data to the table 'gable'
function append_json(data){
var table = document.getElementById('device_table');
for (var device in data) {
var tr = document.createElement('tr');
tr.innerHTML = '<td>' + data[device].id + '</td>' +
'<td>' + data[device].channel_1 + '</td>' +
'<td>' + data[device].channel_1_port + '</td>' +
'<td>' + data[device].channel_2 + '</td>' +
'<td>' + data[device].channel_2_port + '</td>'
table.appendChild(tr);
};
}
</script>
</body>
</html>
我真正想要的是最终由append_json()创建的数据。我该怎么做?
答案 0 :(得分:4)
Jammy Dodger关于硒的评论是正确的。 html由js生成。您的要求不允许该代码像在浏览器中那样执行。我将使用selenium打开页面,然后以这种方式检索DOM。在这里,您可以浏览和抓取所需的数据。它看起来应该像这样。
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument("--headless")
binary = FirefoxBinary('/usr/lib/firefox/firefox')
browser = webdriver.Firefox(firefox_options=options, firefox_binary=binary)
url = f'https://www.rottentomatoes.com/search/?search={title}'
try:
browser.get(url)
# Give the js a little bit of time to generate the html
time.sleep(1)
html = browser.page_source
browser.quit()
soup = BeautifulSoup(html, 'lxml')
答案 1 :(得分:-1)
您应该调用soup.get_text()或获得特定的标签:soup.find(“ a”)甚至是汤中的i,也可以是i.get_text()