Question

pandas.apply函数返回随机子字符串而不是完整字符串

我已经尝试过：

def extract_ticker(title):
    for word in title:
        word_str = word.encode('utf-8')
        if word_str in constituents['Symbol'].values:
            return word_str
sp500news3['tickers'] = sp500news3['title'].apply(extract_ticker)

返回

sp500news3['tickers'] 

79944        M
181781       M
213175       C
93554        C
257327       T

代替预期输出

79944        MSFT
181781       WMB
213175       CSX
93554        C
257327       TWX

从下面创建示例

constituents =  pd.DataFrame({"Symbol":["TWX","C","MSFT","WMB"]})

sp500news3 = pd.DataFrame({"title":["MSFT Vista corporate sales go very well","WMB No Anglican consensus on Episcopal Church","CSX quarterly profit rises",'C says 30 bln capital helps exceed target','TWX plans cable spinoff']})

Answer 1

为什么不使用代码的正则表达式提取呢？

tickers = ('TWX', 'C', 'MSFT', 'WMB')
regex = '({})'.format('|'.join(tickers))

sp500news3['tickers'] = sp500news3['title'].str.extract(regex)

Answer 2

将Series.str.extract与带有单词bondaries和|的连接值一起使用：

pat = '|'.join(r"\b{}\b".format(x) for x in constituents['Symbol'])

sp500news3['tickers'] = sp500news3['title'].str.extract('('+ pat + ')', expand=False)
print (sp500news3)
                                           title tickers
0        MSFT Vista corporate sales go very well    MSFT
1  WMB No Anglican consensus on Episcopal Church     WMB
2                     CSX quarterly profit rises     NaN
3      C says 30 bln capital helps exceed target       C
4                        TWX plans cable spinoff     TWX

您的解决方案应与split一起使用空格，也许encode也必须删除：

def extract_ticker(title):
    for word in title.split():
        word_str = word
        if word_str in constituents['Symbol'].values:
            return word_str

sp500news3['tickers'] = sp500news3['title'].apply(extract_ticker)
print (sp500news3)
                                           title tickers
0        MSFT Vista corporate sales go very well    MSFT
1  WMB No Anglican consensus on Episcopal Church     WMB
2                     CSX quarterly profit rises    None
3      C says 30 bln capital helps exceed target       C
4                        TWX plans cable spinoff     TWX

Pandas.apply返回随机子字符串

2 个答案: