循环遍历带有10个url的python数据框并从中提取内容(BeautifulSoup)

时间:2017-02-15 13:13:57

标签: python beautifulsoup urllib2

我有一个名为'df'的csv,有1列。我有一个标题和10个网址。

Col
"http://www.cnn.com"
"http://www.fark.com"
etc 
etc

这是我的错误代码

import bs4 as bs
df_link = pd.read_csv('df.csv')    
for link in df_link:
        x = urllib2.urlopen(link[0])
        new = x.read()
# Code does not even get past here as far as I checked
        soup = bs.BeautifulSoup(new,"lxml")
        for text in soup.find_all('a',href = True):
            text.append((text.get('href')))

我收到错误消息

ValueError: unknown url type: C

我也得到了这个错误的其他变体,如

问题是,它甚至没有过去

x = urllib2.urlopen(link[0])

另一方面;这是工作代码......

url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
    links.append((link.get('href')))

1 个答案:

答案 0 :(得分:0)

修正了答案

我没有意识到你在使用pandas,所以我所说的并不是很有帮助。

您希望使用pandas执行此操作的方法是迭代行并从中提取信息。以下应该可以工作,而不必删除标题:

import bs4 as bs
import pandas as pd
import urllib2

df_link = pd.read_csv('df.csv')

for link in df_link.iterrows():
    url = link[1]['Col']
    x = urllib2.urlopen(url)
    new = x.read()
    # Code does not even get past here as far as I checked
    soup = bs.BeautifulSoup(new,"lxml")
    for text in soup.find_all('a',href = True):
        text.append((text.get('href')))

下面的原始误导性答案

看起来您的CSV文件的标题没有被单独处理,因此在df_link的第一次迭代中,link[0]"Col",这不是有效的网址