我正在为外汇新闻分析创建基于实体的情感分类。对于每篇新闻文章,可能会确定多种货币。但是我在为如何将一行(如根据现有人类标签的{'USD':1, "JPY":-1}
)分成单独的行而苦苦挣扎。
示例数据框现在为:
sentiment text
0 USD:1,CNY:-1 US economy is improving while China is struggling
1 USD:-1, JPY:1 Unemployment is high for US while low for Japan
并且想要转换成这样的多行:
currency sentiment text
0 USD 1 US economy is improving while China is struggling
1 CNY -1 US economy is improving while China is struggling
2 USD -1 Unemployment is high for US while low for Japan
3 JPY 1 Unemployment is high for US while low for Japan
非常感谢您的帮助
答案 0 :(得分:1)
您可以在sentiment
上拆分,|:
列,然后展开&stack
然后使用pd.reindex
和pd.index.repeat
根据text
拆分重复len
列。
# Split the col on both , and : then stack.
s = df['sentiment'].str.split(',|:',expand=True).stack()
# Reindex and repeat cols on len of split and reset index.
df1 = df.reindex(df.index.repeat(df['sentiment'].fillna("").str.split(',').apply(len)))
df1 = df1.reset_index(drop=True)
df1['currency'] = s[::2].reset_index(drop=True)
df1['sentiment'] = s[1::2].reset_index(drop=True)
print (df1.sort_index(axis=1))
currency sentiment text
0 USD 1 US economy is improving while China is struggling
1 CNY -1 US economy is improving while China is struggling
2 USD -1 Unemployment is high for US while low for Japan
3 JPY 1 Unemployment is high for US while low for Japan
答案 1 :(得分:1)
您还可以尝试通过分割','
并使用melt
选项扩展行来扩展情感。
df1 = df1.merge(df1.sentiment.str.split(',',expand=True),left_index=True,right_index=True,how='outer')
df1.drop('sentiment',axis=1,inplace=True)
df1 = df1.melt('text')
df1[['currency','sentiment']] = df1.value.str.split(':',expand=True)
df1.drop(['variable','value'],axis=1,inplace=True)
输出:
text currency sentiment
0 US economy is improving while China is struggling CNY -1
1 Unemployment is high for US while low for Japan JPY 1
2 US economy is improving while China is struggling USD 1
3 Unemployment is high for US while low for Japan USD -1
答案 2 :(得分:1)
您可以构造一个新的数据框,并根据需要链接和重复值。
import numpy as np
from itertools import chain
df = pd.DataFrame({'sentiment': ['USD:1,CNY:-1', 'USD:-1, JPY:1'],
'text': ['US economy is improving while China is struggling',
'Unemployment is high for US while low for Japan']})
# remove whitespace and split by ','
df['sentiment'] = df['sentiment'].str.replace(' ', '').str.split(',')
# construct expanded dataframe
res = pd.DataFrame({'sentiment': list(chain.from_iterable(df['sentiment'])),
'text': np.repeat(df['text'], df['sentiment'].map(len))})
# split sentiment series into currency and value components
res[['currency', 'sentiment']] = res.pop('sentiment').str.split(':', expand=True)
res['sentiment'] = res['sentiment'].astype(int)
结果:
print(res)
text currency sentiment
0 US economy is improving while China is struggling USD 1
0 US economy is improving while China is struggling CNY -1
1 Unemployment is high for US while low for Japan USD -1
1 Unemployment is high for US while low for Japan JPY 1
答案 3 :(得分:1)
这应该有效
s = df['sentiment'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'sentiment'
del df['sentiment']
df = df.join(s)
df['currency'] = df.sentiment.apply(lambda x: x.split(':')[0])
df['sentiment'] = df.sentiment.apply(lambda x: int(x.split(':')[-1]))
答案 4 :(得分:-1)
尝试执行(不更改原始DataFrame):
import re
def parse_sentiment(sentiment):
currencies = sentiment.split(',')
result = dict()
# remove spaces from currencies
for c in currencies:
temp = re.sub(r'[\s]*', '', c).split(':')
result[temp[0]] = int(temp[1])
return result
i = 0
rows = []
for s in df.loc[:, 'sentiment']:
temp = parse_sentiment(s)
for t in temp:
temp_row = [t, temp[t], df.iloc[i]['text']]
rows.append(temp_row)
i += 1
df_new = pd.DataFrame(rows, columns=['currency', 'sentiment', 'text'])