我想添加一个列,该列应该表示基于其他列的串联的Pandas数据帧的URL。另外我想添加一个条件。
目前,它看起来像这样
matches['url'] = 'http://www.example.org' +
matches['column1'] +
'/' +
(matches['id'].str[-3:] if matches['id'].str.contains('M|-0') else matches['id'].str[-4:]) +
'/xyz.pdf'
我遇到问题的条件是:(matches['id'].str[-3:] if matches['id'].str.contains('M|-0') else matches['id'].str[-4:])
这应该执行以下操作:如果matches['id']
包含字符串M
或-0
,那么matches['id'].str[-3:]
应该发生(即,取{的最后3个字符{1}}列,否则matches['id']
应该发生。
但是,我收到以下错误:
ValueError:系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。
我知道我可以使用matches['id'].str[-4:]
创建一个中间列并在那里对条件进行编码。但是我想用一个漂亮的单行程做这件事,我想我离解决方案并不太远。感谢您的帮助。
答案 0 :(得分:2)
我认为你需要numpy.where
与Series
:
mask = matches['id'].str.contains('M|-0')
matches['url'] = 'http://www.example.org' + matches['column1'] + '/' +
np.where(mask, matches['id'].str[-3:], matches['id'].str[-4:]) + '/xyz.pdf'
样品:
matches = pd.DataFrame({'id':['2010-M012','2010-1234','2010-1234'],
'column1':['s','d','m']})
print (matches)
column1 id
0 s 2010-M012
1 d 2010-1234
2 m 2010-1234
mask = matches['id'].str.contains('M|-0')
matches['url'] = 'http://www.example.org' + matches['column1'] + '/' + \
np.where(mask, matches['id'].str[-3:], matches['id'].str[-4:]) + '/xyz.pdf'
matches['url1'] = 'http://www.example.org' + matches['column1'] + '/' + \
matches['id'].map(lambda x : x[-3:] if (('M' in x) or ('-0' in x)) else x[-4:]) + '/xyz.pdf'
matches['url2'] = matches.apply(lambda x: 'http://www.example.org{}/{}/xyz.pdf'.format(x['column1'], x['id'][-3:] if (('M' in x['id']) or ('-0' in x['id'])) else x['id'][-4:]), axis=1)
print (matches)
column1 id url \
0 s 2010-M012 http://www.example.orgs/012/xyz.pdf
1 d 2010-1234 http://www.example.orgd/1234/xyz.pdf
2 m 2010-1234 http://www.example.orgm/1234/xyz.pdf
url1 url2
0 http://www.example.orgs/012/xyz.pdf http://www.example.orgs/012/xyz.pdf
1 http://www.example.orgd/1234/xyz.pdf http://www.example.orgd/1234/xyz.pdf
2 http://www.example.orgm/1234/xyz.pdf http://www.example.orgm/1234/xyz.pdf
<强>计时强>:
matches = pd.DataFrame({'id':['2010-M012','2010-1234','2010-1234'],
'column1':['s','d','m']})
#[30000 rows x 2 columns]
matches = pd.concat([matches]*10000).reset_index(drop=True)
In [168]: %timeit matches['url'] = 'http://www.example.org' + matches['column1'] + '/' + np.where(matches['id'].str.contains('M|-0'), matches['id'].str[-3:], matches['id'].str[-4:]) + '/xyz.pdf'
10 loops, best of 3: 50.9 ms per loop
In [169]: %timeit matches['url1'] = 'http://www.example.org' + matches['column1'] + '/' + matches['id'].map(lambda x : x[-3:] if (('M' in x) or ('-0' in x)) else x[-4:]) + '/xyz.pdf'
10 loops, best of 3: 22.1 ms per loop
In [170]: %timeit matches['url2'] = matches.apply(lambda x: 'http://www.example.org{}/{}/xyz.pdf'.format(x['column1'], x['id'][-3:] if (('M' in x['id']) or ('-0' in x['id'])) else x['id'][-4:]), axis=1)
1 loop, best of 3: 1.07 s per loop
答案 1 :(得分:1)
变化:
(matches['id'].str[-3:] if matches['id'].str.contains('M|-0') else matches['id'].str[-4:])
到:
np.where(matches['id'].str.contains('M|-0'), matches['id'].str[-3:],matches['id'].str[-4:])
看它是否有效。