我有一个数据集,我想从中提取一些网址。问题是,当我想将提取的值添加回数据帧时,行索引不正确,因此提取的值与正确的值不对应
my_data
username date text extracted_url
0 sports 2018-05-08 13:20 something google.com [google.com]
1 sports 2018-05-08 12:34 two links google.com yahoo.com [google.com, yahoo.com]
2 sports 2018-05-08 12:34 some text without links
3 sports 2018-05-08 12:34 google.com [google.com]
代码
import pandas as pd
import requests
import urllib, urlparse
from urlparse import urlsplit
my_file = pd.read_csv('my_file.csv', sep=';', engine='python', error_bad_lines=False)
df = pd.DataFrame(my_file)
text = my_file['text'].str.extract('(https?://[^>]+)' , expand=False).dropna()
print my_file
sep = ' :|\spic|#'
r = text.str.split(pat=sep, expand=False)
se = pd.Series(r)
links = []
item_ids = []
my_file['extracted_links'] = r
for index, row in r.iteritems():
link = row[0].replace(" ", "")
response = requests.get(link).url
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
if base_url=="http://www.google.com/":
item_id = response.rsplit('/', 1)
links.append(response)
item_ids.append(item_id[-1])
else:
links.append('nan')
item_ids.append('nan')
df['links'] = pd.Series(links)
df['item_ids'] = pd.Series(item_ids)
df.to_csv('example.csv')
我得到的输出
extracted_url links
0 [google.com] google.com
1 [google.com, yahoo.com] google.com
2 google.com
3 [google.com]
预期产出:
extracted_url links
0 [google.com] google.com
1 [google.com, yahoo.com] google.com
2 nan nan
3 [google.com] google.com
答案 0 :(得分:0)
现在使用以下代码正常工作,虽然我不确定这是否是最优雅的解决方案
for index, row in r.iteritems():
link = row.replace(" ", "")
response = requests.get(link).url
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
if base_url=="http://www.sxc.com/":
re = urllib.unquote(response.encode("ascii"))
item_id = re.rsplit('/', 1)
df['links'].loc[index] = re
df['item_ids'].loc[index] = item_id[-1]