在抓取网址列表形成csv时跳过错误

时间:2018-05-14 10:43:28

标签: python csv screen-scraping

我设法从CSV文件中删除了一个网址列表,但是我遇到了一个问题,当点击时,抓取会停止。它还打印了很多行,是否有可能摆脱它们?

在此感谢一些帮助。先感谢您 !

以下是代码:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup #required to parse html
import requests #required to make request

#read file
with open('urls.csv','r') as f:
    csv_raw_cont=f.read()

#split by line
split_csv=csv_raw_cont.split('\n')

#specify separator
separator=";"

#iterate over each line
for each in split_csv:

    #specify the row index
    url_row_index=0 #in our csv example file the url is the first row so we set 0

    #get the url
    url = each.split(separator)[url_row_index] 

    #fetch content from server
    html = requests.get(url).content

    #soup fetched content
    soup = BeautifulSoup(html,'lxml')

    tags = soup.find("div", {"class": "productsPicture"}).findAll("a")

    for tag in tags:
       print(tag.get('href'))

错误的结果如下所示:

https://www.tennis-point.com/asics-gel-resolution-7-all-court-shoe-men-white-silver-02013802720000.html
None
https://www.tennis-point.com/cep-ultralight-run-sports-socks-men-black-light-green-12143000063000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-clay-court-shoe-men-white-grey-02013802634000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-all-court-shoe-men-white-silver-02013802723000.html
None
https://www.tennis-point.com/asics-gel-challenger-9-indoor-carpet-shoe-men-white-grey-02012401735000.html
None
https://www.tennis-point.com/asics-gel-court-speed-clay-court-shoe-men-dark-blue-yellow-02014202833000.html
None
https://www.tennis-point.com/asics-gel-court-speed-all-court-shoe-men-white-silver-02014202832000.html
None
Traceback (most recent call last):
File "/Users/imaging-adrian/Desktop/Python Scripts/close_to_work.py", line 33, in <module>
tags = soup.find("div", {"class": "productsPicture"}).findAll("a")
AttributeError: 'NoneType' object has no attribute 'findAll'
[Finished in 3.7s with exit code 1]
[shell_cmd: python -u "/Users/imaging-adrian/Desktop/Python 
Scripts/close_to_work.py"]
[dir: /Users/imaging-adrian/Desktop/Python Scripts]
[path: /Users/imaging-adrian/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/munki]

我的CSV文件中的链接如下所示:

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E701Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-4907;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E600N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E326Y-0174;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E801N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-9093;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E804N-9095;

1 个答案:

答案 0 :(得分:1)

这是工作版,

from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile, delimiter=';')
    writer = csv.writer(results)

    for row in reader:
        # get the url
        url = row[0]

        # fetch content from server
        html = requests.get(url).content

        # soup fetched content
        soup = BeautifulSoup(html, 'html.parser')

        divTag = soup.find("div", {"class": "productsPicture"})

        if divTag:
            tags = divTag.findAll("a")
        else:
            continue

        for tag in tags:
            res = tag.get('href')
            if res != None:
                writer.writerow([res])