保持for循环pandas返回的迭代项的索引

时间:2018-05-08 14:22:27

标签: python pandas

我有一个数据集,我想从中提取一些网址。问题是,当我想将提取的值添加回数据帧时,行索引不正确,因此提取的值与正确的值不对应

my_data

   username       date             text                   extracted_url 
0   sports  2018-05-08 13:20     something google.com     [google.com]
1   sports  2018-05-08 12:34     two links google.com yahoo.com [google.com, yahoo.com]

2   sports  2018-05-08 12:34     some text without links                                       
3   sports  2018-05-08 12:34     google.com                [google.com]

代码

import pandas as pd
import requests
import urllib, urlparse
from urlparse import urlsplit




my_file = pd.read_csv('my_file.csv', sep=';',  engine='python', error_bad_lines=False)
df = pd.DataFrame(my_file)

text = my_file['text'].str.extract('(https?://[^>]+)' , expand=False).dropna()

print my_file
sep = ' :|\spic|#'

r = text.str.split(pat=sep, expand=False)

se = pd.Series(r)



links = []
item_ids = []
my_file['extracted_links'] = r


for index, row in r.iteritems():
    link = row[0].replace(" ", "")
    response = requests.get(link).url
    base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
    if base_url=="http://www.google.com/":
        item_id = response.rsplit('/', 1)
        links.append(response)
        item_ids.append(item_id[-1])
    else:
        links.append('nan')
        item_ids.append('nan')



df['links'] = pd.Series(links)
df['item_ids'] = pd.Series(item_ids)


df.to_csv('example.csv')

我得到的输出

    extracted_url           links
0   [google.com]            google.com
1   [google.com, yahoo.com] google.com
2                           google.com              
3   [google.com]

预期产出:

         extracted_url           links
    0   [google.com]            google.com
    1   [google.com, yahoo.com] google.com
    2    nan                     nan                
    3   [google.com]            google.com

1 个答案:

答案 0 :(得分:0)

现在使用以下代码正常工作,虽然我不确定这是否是最优雅的解决方案

for index, row in r.iteritems():
    link = row.replace(" ", "")
    response = requests.get(link).url

    base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
    if base_url=="http://www.sxc.com/":
        re = urllib.unquote(response.encode("ascii"))
        item_id = re.rsplit('/', 1)
        df['links'].loc[index] = re
        df['item_ids'].loc[index] = item_id[-1]