加入两个csv文件(1-N关系)

时间:2017-10-19 09:45:42

标签: python bash csv join awk

csv文件是制表符分隔的

file1.csv:

id_album   name        date
001        Nevermind   24/09/1991
...

file2.csv:

id_song   id_album   name  
001       001        Smells Like Teen Spirit
002       001        In Bloom
...

我想获得这个output.csv:

id_album   name        date         songs
001        Nevermind   24/09/1991   001,Smells Like Teen Spirit,002,In Bloom,...

你有没有办法在Bash(最好)或Python中做到这一点?

我的csv文件中有很多记录(数百万行)。

修改

我尝试加入/ sed / awk但无法管理1到N的关系

2 个答案:

答案 0 :(得分:2)

发现 awk 语言:

awk -F'[[:space:]][[:space:]]+' 'NR==FNR{ if(NR>1) a[$2]=($2 in a? a[$2]",":"")$1","$3; next}
       FNR==1{ print $0,"songs" }
       $1 in a{ print $0,a[$1] }' file2.csv OFS='\t' file1.csv > output.csv

output.csv内容:

id_album   name        date songs
001        Nevermind   24/09/1991   001,Smells Like Teen Spirit,002,In Bloom

答案 1 :(得分:1)

TL; DR

from io import StringIO
file1 = """id_album,name,date
001,Nevermind,24/09/1991"""

file2 = """id_song,id_album,name
001,001,Smells Like Teen Spirit
002,001,In Bloom"""

df1 = pd.read_csv(StringIO(file1))
df1 = df1.rename(columns={'name':'album_name'})

df2 = pd.read_csv(StringIO(file2))
df2 = df2.rename(columns={'name':'song_name'})


df3 = df1.merge(df2, on='id_album')
df4 = pd.DataFrame(list({album['id_album'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}.items()), columns=['id_album', 'song_id_name'])

df_want = df1.merge(df4)

[OUT]:

>>> df_want
   id_album album_name        date                          song_id_name
0         1  Nevermind  24/09/1991  1,Smells Like Teen Spirit,2,In Bloom

在长

假设:

>>> from io import StringIO
>>> file1 = """id_album,name,date
... 001,Nevermind,24/09/1991"""

>>> file2 = """id_song,id_album,name
... 001,001,Smells Like Teen Spirit
... 002,001,In Bloom"""

>>> df1 = pd.read_csv(StringIO(file1))
>>> df1 = df1.rename(columns={'name':'album_name'})

>>> df2 = pd.read_csv(StringIO(file2))
>>> df2 = df2.rename(columns={'name':'song_name'})

>>> df1
   id_album album_name        date
0         1  Nevermind  24/09/1991

>>> df2
   id_song  id_album                   name  
0        1         1  Smells Like Teen Spirit
1        2         1                 In Bloom

首先合并id_album列上的2个DataFrame:

>>> df3 = df1.merge(df2, on='id_album')
>>> df3
   id_album album_name        date  id_song                song_name
0         1  Nevermind  24/09/1991        1  Smells Like Teen Spirit
1         1  Nevermind  24/09/1991        2                 In Bloom

现在有一些pandas技巧:

1. First group the rows by the `id_album` column:
2. In each group, get the `id_song` and `song_name` columns and stack them

>> [','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])]
['1,Smells Like Teen Spirit,2,In Bloom']

以类似的方式,从.groupby()获取album_name:

>>> [album['album_name'].unique()[0] for idx, album in df3.groupby(['id_album'])]
['Nevermind']

让我们结合两个groupby操作:

>>> {album['album_name'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}
{'Nevermind': '1,Smells Like Teen Spirit,2,In Bloom'}

>>> album2songs = {album['album_name'].unique()[0]:','.join(list(album[['id_song', 'song_name']].astype(str).stack())) for idx, album in df3.groupby(['id_album'])}

album2songs放入数据框:

>>> df4 = pd.DataFrame(list(album2songs.items()), columns=['album_name', 'song_id_name'])
>>> df4
  album_name                          song_id_name
0  Nevermind  1,Smells Like Teen Spirit,2,In Bloom

现在加入df1df4

>>> df1.merge(df4)
   id_album album_name        date                          song_id_name
0         1  Nevermind  24/09/1991  1,Smells Like Teen Spirit,2,In Bloom
BTW,@ RomanPerekhrest awk解决方案更酷了!