我想从多个网页中抓取所有URL。可以,但是只有最后一个网页的结果会保存在文件中。
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']
for url in urls:
response = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
links.append(link.get('href'))
filename = 'output.csv'
with open(filename, mode="w") as outfile:
for s in links:
outfile.write("%s\n" % s)
我在这里想念什么?
如果我可以使用包含所有URL而不是列表的csv文件,它甚至会更酷。但是我尝试过的一切都还没完成...
答案 0 :(得分:1)
您正在使用网址的最后一汤。您应该将每个第二个移动到第一个。另外,您将获得与正则表达式匹配的所有元素。您要抓取的表格外有其他元素。
(function() {
'use strict'
var users = [{}];
function userExist(x) {
for (var i = 0; i < users.length; i++) {
if (users[i].username == x) {
return arr[i]
} else {
return false;
}
}
}
function reg() {
var x = document.forms.first.regUser.value;
var y = document.forms.first.regPass.value;
var z = document.forms.first.regVer.value;
if (y != z) {
document.getElementById("text").innerHTML = "passwords dont match";
} else if (x == "" || y == "") {
document.getElementById("text").innerHTML = "password & username are mandatory";
} else if (userExist(x) != false) {
document.getElementById("text").innerHTML = "user already exists! try logging in";
} else {
user = {
username: x,
password: y
}
users.push(user);
}
}
function log() {
var x = document.forms.log.logUser.value;
var y = document.forms.log.logPass.value;
var q = userExist(x);
if (q != false) {
if (q.password == y) {
document.getElementById("log").innerHTML = "Login!";
} else {
document.getElementById("log").innerHTML = "password incorrect!";
}
} else {
document.getElementById("log").innerHTML = "username doesn't exist!";
}
}
function rem() {
var x = document.forms.rem.remUser.value;
var y = document.forms.rem.remPass.value;
var q = users.indexOf(userExist(x));
if (q == -1) {
document.getElementById("rem").innerHTML = "username doesn't exist!";
} else {
if (users[q].password == y) {
users.splice(q, 1)
document.getElementById("rem").innerHTML = "user removed successfully!";
} else {
document.getElementById("rem").innerHTML = "password incorrect!";
}
}
}
document.getElementById('logBtn').addEventListener('click', log, false);
document.getElementById('remBtn').addEventListener('click', rem, false);
document.getElementById('regBtn').addEventListener('click', reg, false);
document.getElementById("man").innerHTML = "say hi";
})()
这是结果。
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<!-- <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO"
crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js" integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/js/bootstrap.min.js" integrity="sha384-ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
crossorigin="anonymous"></script> -->
<link rel="stylesheet" href="website_1.css">
<title>nave's website</title>
</head>
<body>
<h1>Welcome!</h1>
<form name="first">
<h2>Register</h2>
<input type="text" name="regUser" placeholder="enter user"> <br>
<input type="text" name="regPass" placeholder="enter password"><br>
<input type="text" name="regVer" placeholder="verify password"><br>
<button id="regBtn" class="btn btn-default" type="button">Register</button><br>
</form>
<p id="text"></p><br>
<form name="log">
<h2>Log in</h2>
<input type="text" name="logUser" placeholder="enter user"><br>
<input type="text" name="logPass" placeholder="enter password"><br>
<button id="logBtn" class="btn btn-default" type="button">login</button><br>
<p id="log"></p><br>
</form>
<h2>Remove</h2>
<form name="rem">
<input type="text" name="remUser" placeholder="enter user"><br>
<input type="text" name="remPass" placeholder="enter password"><br>
<button id="remBtn" class="btn btn-default" type="button">Remove</button><br>
<p id="rem"></p><br>
</form>
<p id="man"></p>
</body>
<script>
</script>
</html>
答案 1 :(得分:1)
嘿,这是我的第一个答案,所以请尽我最大的帮助。
数据覆盖的问题是,您在一个循环中循环访问url,然后在另一循环中循环访问汤对象。
这将始终返回循环末尾的最后一个汤对象,因此最好的办法是将每个汤对象从url循环内追加到数组中,或者在url循环中实际查询汤对象:
soup_obj_list = []
for url in urls:
response = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="html.parser")
soup_obj_list.append(soup)
希望能解决您的第一个问题。无法真正解决csv问题。