Question

我设法从CSV文件中删除了一个网址列表，但是我遇到了一个问题，当点击时，抓取会停止。它还打印了很多无行，是否有可能摆脱它们？

在此感谢一些帮助。先感谢您！

以下是代码：

#!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup #required to parse html import requests #required to make request #read file with open('urls.csv','r') as f: csv_raw_cont=f.read() #split by line split_csv=csv_raw_cont.split('\n') #specify separator separator=";" #iterate over each line for each in split_csv: #specify the row index url_row_index=0 #in our csv example file the url is the first row so we set 0 #get the url url = each.split(separator)[url_row_index] #fetch content from server html = requests.get(url).content #soup fetched content soup = BeautifulSoup(html,'lxml') tags = soup.find("div", {"class": "productsPicture"}).findAll("a") for tag in tags: print(tag.get('href'))

错误的结果如下所示：

https://www.tennis-point.com/asics-gel-resolution-7-all-court-shoe-men-white-silver-02013802720000.html None https://www.tennis-point.com/cep-ultralight-run-sports-socks-men-black-light-green-12143000063000.html None https://www.tennis-point.com/asics-gel-solution-speed-3-clay-court-shoe-men-white-grey-02013802634000.html None https://www.tennis-point.com/asics-gel-solution-speed-3-all-court-shoe-men-white-silver-02013802723000.html None https://www.tennis-point.com/asics-gel-challenger-9-indoor-carpet-shoe-men-white-grey-02012401735000.html None https://www.tennis-point.com/asics-gel-court-speed-clay-court-shoe-men-dark-blue-yellow-02014202833000.html None https://www.tennis-point.com/asics-gel-court-speed-all-court-shoe-men-white-silver-02014202832000.html None Traceback (most recent call last): File "/Users/imaging-adrian/Desktop/Python Scripts/close_to_work.py", line 33, in <module> tags = soup.find("div", {"class": "productsPicture"}).findAll("a") AttributeError: 'NoneType' object has no attribute 'findAll' [Finished in 3.7s with exit code 1] [shell_cmd: python -u "/Users/imaging-adrian/Desktop/Python Scripts/close_to_work.py"] [dir: /Users/imaging-adrian/Desktop/Python Scripts] [path: /Users/imaging-adrian/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/munki]

我的CSV文件中的链接如下所示：

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E701Y-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-4907; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E600N-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E326Y-0174; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E801N-4589; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-9093; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-4589; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E804N-9095;

Answer 1

这是工作版，

from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile, delimiter=';')
    writer = csv.writer(results)

    for row in reader:
        # get the url
        url = row[0]

        # fetch content from server
        html = requests.get(url).content

        # soup fetched content
        soup = BeautifulSoup(html, 'html.parser')

        divTag = soup.find("div", {"class": "productsPicture"})

        if divTag:
            tags = divTag.findAll("a")
        else:
            continue

        for tag in tags:
            res = tag.get('href')
            if res != None:
                writer.writerow([res])

在抓取网址列表形成csv时跳过错误

1 个答案: