抓取多个网页,但结果被最后一个URL覆盖

时间:2018-12-24 11:53:17

标签: python python-3.x web-scraping beautifulsoup urllib

我想从多个网页中抓取所有URL。可以,但是只有最后一个网页的结果会保存在文件中。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']

for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")

links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
    links.append(link.get('href'))

filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)

我在这里想念什么?

如果我可以使用包含所有URL而不是列表的csv文件,它甚至会更酷。但是我尝试过的一切都还没完成...

2 个答案:

答案 0 :(得分:1)

您正在使用网址的最后一汤。您应该将每个第二个移动到第一个。另外,您将获得与正则表达式匹配的所有元素。您要抓取的表格外有其他元素。

(function() {
  'use strict'
  var users = [{}];

  function userExist(x) {
    for (var i = 0; i < users.length; i++) {
      if (users[i].username == x) {
        return arr[i]
      } else {
        return false;
      }
    }

  }

  function reg() {
    var x = document.forms.first.regUser.value;
    var y = document.forms.first.regPass.value;
    var z = document.forms.first.regVer.value;
    if (y != z) {
      document.getElementById("text").innerHTML = "passwords dont match";
    } else if (x == "" || y == "") {
      document.getElementById("text").innerHTML = "password & username are mandatory";
    } else if (userExist(x) != false) {
      document.getElementById("text").innerHTML = "user already exists! try logging in";
    } else {
      user = {
        username: x,
        password: y
      }
      users.push(user);
    }
  }

  function log() {
    var x = document.forms.log.logUser.value;
    var y = document.forms.log.logPass.value;
    var q = userExist(x);

    if (q != false) {
      if (q.password == y) {
        document.getElementById("log").innerHTML = "Login!";
      } else {
        document.getElementById("log").innerHTML = "password incorrect!";
      }
    } else {
      document.getElementById("log").innerHTML = "username doesn't exist!";

    }
  }

  function rem() {
    var x = document.forms.rem.remUser.value;
    var y = document.forms.rem.remPass.value;
    var q = users.indexOf(userExist(x));

    if (q == -1) {
      document.getElementById("rem").innerHTML = "username doesn't exist!";
    } else {
      if (users[q].password == y) {
        users.splice(q, 1)
        document.getElementById("rem").innerHTML = "user removed successfully!";
      } else {
        document.getElementById("rem").innerHTML = "password incorrect!";

      }

    }


  }

  document.getElementById('logBtn').addEventListener('click', log, false);
  document.getElementById('remBtn').addEventListener('click', rem, false);

  document.getElementById('regBtn').addEventListener('click', reg, false);

  document.getElementById("man").innerHTML = "say hi";
})()

这是结果。

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">

  <!-- <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO"
        crossorigin="anonymous">
    <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
        crossorigin="anonymous"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js" integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
        crossorigin="anonymous"></script>
    <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/js/bootstrap.min.js" integrity="sha384-ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
        crossorigin="anonymous"></script> -->
  <link rel="stylesheet" href="website_1.css">
  <title>nave's website</title>
</head>

<body>
  <h1>Welcome!</h1>
  <form name="first">
    <h2>Register</h2>
    <input type="text" name="regUser" placeholder="enter user"> <br>
    <input type="text" name="regPass" placeholder="enter password"><br>
    <input type="text" name="regVer" placeholder="verify password"><br>
    <button id="regBtn" class="btn btn-default" type="button">Register</button><br>
  </form>
  <p id="text"></p><br>
  <form name="log">
    <h2>Log in</h2>
    <input type="text" name="logUser" placeholder="enter user"><br>
    <input type="text" name="logPass" placeholder="enter password"><br>
    <button id="logBtn" class="btn btn-default" type="button">login</button><br>
    <p id="log"></p><br>
  </form>

  <h2>Remove</h2>
  <form name="rem">
    <input type="text" name="remUser" placeholder="enter user"><br>
    <input type="text" name="remPass" placeholder="enter password"><br>
    <button id="remBtn" class="btn btn-default" type="button">Remove</button><br>
    <p id="rem"></p><br>
  </form>

  <p id="man"></p>
</body>
<script>
</script>

</html>

答案 1 :(得分:1)

嘿,这是我的第一个答案,所以请尽我最大的帮助。

数据覆盖的问题是,您在一个循环中循环访问url,然后在另一循环中循环访问汤对象。

这将始终返回循环末尾的最后一个汤对象,因此最好的办法是将每个汤对象从url循环内追加到数组中,或者在url循环中实际查询汤对象:

soup_obj_list = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    soup_obj_list.append(soup)

希望能解决您的第一个问题。无法真正解决csv问题。