pandas.apply函数返回随机子字符串而不是完整字符串
我已经尝试过:
def extract_ticker(title):
for word in title:
word_str = word.encode('utf-8')
if word_str in constituents['Symbol'].values:
return word_str
sp500news3['tickers'] = sp500news3['title'].apply(extract_ticker)
返回
sp500news3['tickers']
79944 M
181781 M
213175 C
93554 C
257327 T
代替预期输出
79944 MSFT
181781 WMB
213175 CSX
93554 C
257327 TWX
从下面创建示例
constituents = pd.DataFrame({"Symbol":["TWX","C","MSFT","WMB"]})
sp500news3 = pd.DataFrame({"title":["MSFT Vista corporate sales go very well","WMB No Anglican consensus on Episcopal Church","CSX quarterly profit rises",'C says 30 bln capital helps exceed target','TWX plans cable spinoff']})
答案 0 :(得分:0)
为什么不使用代码的正则表达式提取呢?
tickers = ('TWX', 'C', 'MSFT', 'WMB')
regex = '({})'.format('|'.join(tickers))
sp500news3['tickers'] = sp500news3['title'].str.extract(regex)
答案 1 :(得分:0)
将Series.str.extract
与带有单词bondaries和|
的连接值一起使用:
pat = '|'.join(r"\b{}\b".format(x) for x in constituents['Symbol'])
sp500news3['tickers'] = sp500news3['title'].str.extract('('+ pat + ')', expand=False)
print (sp500news3)
title tickers
0 MSFT Vista corporate sales go very well MSFT
1 WMB No Anglican consensus on Episcopal Church WMB
2 CSX quarterly profit rises NaN
3 C says 30 bln capital helps exceed target C
4 TWX plans cable spinoff TWX
您的解决方案应与split
一起使用空格,也许encode
也必须删除:
def extract_ticker(title):
for word in title.split():
word_str = word
if word_str in constituents['Symbol'].values:
return word_str
sp500news3['tickers'] = sp500news3['title'].apply(extract_ticker)
print (sp500news3)
title tickers
0 MSFT Vista corporate sales go very well MSFT
1 WMB No Anglican consensus on Episcopal Church WMB
2 CSX quarterly profit rises None
3 C says 30 bln capital helps exceed target C
4 TWX plans cable spinoff TWX