我有一个名为'df'的csv,有1列。我有一个标题和10个网址。
Col
"http://www.cnn.com"
"http://www.fark.com"
etc
etc
这是我的错误代码
import bs4 as bs
df_link = pd.read_csv('df.csv')
for link in df_link:
x = urllib2.urlopen(link[0])
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
我收到错误消息
ValueError: unknown url type: C
我也得到了这个错误的其他变体,如
问题是,它甚至没有过去
x = urllib2.urlopen(link[0])
另一方面;这是工作代码......
url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
links.append((link.get('href')))
答案 0 :(得分:0)
我没有意识到你在使用pandas
,所以我所说的并不是很有帮助。
您希望使用pandas
执行此操作的方法是迭代行并从中提取信息。以下应该可以工作,而不必删除标题:
import bs4 as bs
import pandas as pd
import urllib2
df_link = pd.read_csv('df.csv')
for link in df_link.iterrows():
url = link[1]['Col']
x = urllib2.urlopen(url)
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
看起来您的CSV文件的标题没有被单独处理,因此在df_link
的第一次迭代中,link[0]
是"Col"
,这不是有效的网址