我有2个数据帧。一个数据框有四列:'Sample_Artists','Sample_Songs','Sampled_Songs'和'Sampled_Artists'。另一个数据框有两列:“艺术家”和“歌曲”。第二个数据帧包含与第一个相同的所有艺术家和歌曲名称,但第一个数据帧包含我想要保留的关系数据(换句话说,第一个数据帧中包含的所有艺术家和歌曲对是第二个中的唯一行数据框)。
基本上,我想在我的第一个数据框中创建另外两列,它使用我的第二个数据帧的索引作为ID,这样对于每个唯一的Artist和Song对,我有一个来自我的第二个数据帧的匹配索引
以下是我想要做的一个简单示例:
说我有
df =
Sample_Artist Sample_Song Sampled_Artist Sampled_Song
A+ foo B+ bar
A+ foobar C+ barfoo
B+ 5 A+ foobar
然后我有另一个数据帧
df1 =
index Artist Song
0 A+ foo
1 A+ foobar
2 B+ bar
3 B+ 5
4 C+ barfoo
现在我想在我的第一个数据框中添加两列:
df =
Sample_Artist Sample_Song Sampled_Artist Sampled_Song Sample_ID Sampled_ID
A+ foo B+ bar 0 2
A+ foobar C+ barfoo 1 4
B+ 5 A+ foobar 3 0
这看起来非常简单,但我无法弄清楚从哪里开始。我使用groupby做了类似的事情,但无法使我的索引与我的第二个数据帧匹配(示例中为df1)。
编辑:
import io
import pandas as pd
df = pd.read_table(io.BytesIO('''\
Sample_Artist Sample_Song Sampled_Artist Sampled_Song
A+ foo B+ bar
A+ foobar C+ barfoo
B+ 5 A+ foobar
A+ foo B+ 5'''), sep='\s+')
df1 = pd.read_table(io.BytesIO('''\
Artist Song
A+ foo
A+ foobar
B+ bar
B+ 5
C+ barfoo'''), sep='\s+')
df.index.names = ['Sample_ID']
df1.index.names = ['Sampled_ID']
df = df.reset_index()
df1 = df1.reset_index()
result = pd.merge(df, df1, left_on=['Sampled_Artist', 'Sampled_Song'],
right_on=['Artist', 'Song'],
how='left')
result = result[['Sample_Artist',
'Sample_Song',
'Sampled_Artist',
'Sampled_Song',
'Sample_ID',
'Sampled_ID']]
print(result)
Sample_Artist Sample_Song Sampled_Artist Sampled_Song Sample_ID Sampled_ID
0 A+ foo B+ bar 0 2
1 A+ foobar C+ barfoo 1 4
2 B+ 5 A+ foobar 2 1
3 A+ foo B+ 5 3 3
所以你的代码给了我Sample_ID和Sampled_ID等于3(索引3,当它应该给Sample_ID = 0和Sample_ID = 3)。整个Sample_ID列都关闭了(Sampled_ID很好),但我无法确定原因。
我想看看:
Sample_Artist Sample_Song Sampled_Artist Sampled_Song Sample_ID Sampled_ID
0 A+ foo B+ bar 0 2
1 A+ foobar C+ barfoo 1 4
2 B+ 5 A+ foobar 3 1
3 A+ foo B+ 5 0 3
答案 0 :(得分:0)
import io
import pandas as pd
df = pd.read_table(io.BytesIO('''\
Sample_Artist Sample_Song Sampled_Artist Sampled_Song
A+ foo B+ bar
A+ foobar C+ barfoo
B+ 5 A+ foobar
A+ foo B+ 5'''), sep='\s+')
df1 = pd.read_table(io.BytesIO('''\
Artist Song
A+ foo
A+ foobar
B+ bar
B+ 5
C+ barfoo'''), sep='\s+')
df1.index.names = ['Sampled_ID']
df1 = df1.reset_index()
grouped = df.groupby(['Sample_Artist', 'Sample_Song'])
df['Sample_ID'] = grouped['Sample_Artist'].transform(
lambda grp: grp.index.get_level_values(0)[0])
result = pd.merge(df, df1, left_on=['Sampled_Artist', 'Sampled_Song'],
right_on=['Artist', 'Song'],
how='left')
result = result[['Sample_Artist',
'Sample_Song',
'Sampled_Artist',
'Sampled_Song',
'Sample_ID',
'Sampled_ID']]
print(result)
产量
Sample_Artist Sample_Song Sampled_Artist Sampled_Song Sample_ID Sampled_ID
0 A+ foo B+ bar 0 2
1 A+ foobar C+ barfoo 1 4
2 B+ 5 A+ foobar 2 1
3 A+ foo B+ 5 0 3