Question

我的目标是抓取存储在CSV文件中的URL列表。示例URL的格式如下：

http://mashable.com/2013/01/07/amazon-instant-video-browser/

如果我尝试将URL列表解析为Beautifulsoup，我现在得到以下错误：

URLError: <urlopen error unknown url type: http>

有人知道如何解决此问题吗？我认为这可能很容易解决，但我无法解决。这是我当前正在使用的代码：

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

contents = []
with open('url.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents

for url in contents:  # Parse through each url in the list.
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "html.parser")
print(soup)

Answer 1

在您的For循环中使用Try和except以避免任何http错误。例如。：

for url in urls:
       **try:**
         contents.append(url) # Add each url to list contents
       **except:
         pass**
for url in contents:  # Parse through each url in the list.
       **try:**
         page = urlopen(url[0]).read()
         soup = BeautifulSoup(page, "html.parser")
       **except:
         pass**

CSV中的URL爬取列表给出了未知的URL类型错误

1 个答案: