我是Python的新手,我正在尝试加入两个CSV文件(由“;”分隔)
CSV1
Sender;Recipient
Adam;123
Alex;234
John;123
Adam;888
CSV2
Name;Phone
Winnie;123,234,456
Celeste;777,888,999
预期输出:
Sender;Recipient;RecipientName
Adam;123;Winnie
Alex;234;Winnie
John;123;Winnie
Adam;888;Celeste
Phone
中的 CSV2
由逗号分隔。所以当我匹配时,我需要进行某种搜索或%LIKE%
。
我知道我可以使用join
来执行vlookup类型但是如何实现%LIKE%
?
答案 0 :(得分:3)
str.split
将Phone
列转换为列表str.len()
查找每个列表的长度。我们将使用它来展开'Name'
列repeat
爆炸'Name'
d1
的副本,我们已使用map
和我们制作的新词典添加新列。p = d2.Phone.str.split(',')
p = p[p.astype(bool)]
l = p.str.len()
p2 = np.concatenate(p.values).astype(int)
nm = d2.Name.repeat(l)
m = dict(zip(p2, nm))
df = d1.assign(RecipientName=d1.Recipient.map(m))
print(df)
Sender Recipient RecipientName
0 Adam 123 Winnie
1 Alex 234 Winnie
2 John 123 Winnie
3 Adam 888 Celeste
df.to_csv('out.csv', sep=';', header=None)
Sender;Recipient;RecipientName
Adam;123;Winnie
Alex;234;Winnie
John;123;Winnie
Adam;888;Celeste
答案 1 :(得分:1)
Series
的{{3}}解决方案:
from itertools import chain
#split values by `,` to lists
lens = df2['Phone'].str.split(',')
#if some zero list remove it
df2 = df2.dropna(subset=['Phone'])
#explode Names by length of lists, flat values by chain.from_iterable
s = pd.Series(np.repeat(df2.Name.values, lens),
index= list(chain.from_iterable(df2.Phone.values)))
#convert index to int for match
s.index = s.index.astype(int)
print (s)
123 Winnie
234 Winnie
456 Winnie
777 Celeste
888 Celeste
999 Celeste
dtype: object
#map values to new column
df1['RecipientName'] = df1['Recipient'].map(s)
print(df1)
Sender Recipient RecipientName
0 Adam 123 Winnie
1 Alex 234 Winnie
2 John 123 Winnie
3 Adam 888 Celeste
#write to csv
df.to_csv('out.csv', sep=';', header=None)
Sender;Recipient;RecipientName
Adam;123;Winnie
Alex;234;Winnie
John;123;Winnie
Adam;888;Celeste
与map
的解决方案类似:
df2['Phone'] = df2['Phone'].str.split(',')
df2 = df2.dropna(subset=['Phone'])
s = pd.Series(np.repeat(df2.Name.values, df2.Phone.str.len()),
index= list(chain.from_iterable(df2.Phone.values)))
s.index = s.index.astype(int)
s.name = 'RecipientName'
print (s)
df1 = df1.join(s, on='Recipient')
print(df1)
Sender Recipient RecipientName
0 Adam 123 Winnie
1 Alex 234 Winnie
2 John 123 Winnie
3 Adam 888 Celeste
编辑:
我的数据样本:
import pandas as pd
from pandas.compat import StringIO
temp=u"""
Sender;Recipient
Adam;123
Alex;234
John;123
Adam;888"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df1 = pd.read_csv(StringIO(temp), sep=";")
print (df1)
Sender Recipient
0 Adam 123
1 Alex 234
2 John 123
3 Adam 888
temp=u"""
Name;Phone
Winnie;123,234,456
Celeste;777,888,999"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df2 = pd.read_csv(StringIO(temp), sep=";")
print (df2)
Name Phone
0 Winnie 123,234,456
1 Celeste 777,888,999
答案 2 :(得分:0)
这是一些伪代码和关于如何做到这一点的想法。
我首先要解析CSV2文件。跳过第一行,然后按以下几行解析名称&电话,然后维护一个字典,其中的姓名与每个电话号码相关联。
numbers_to_names = {}
for line in open("csv2", "r").splitlines():
name, phone_numbers = line.split(";")
for phone_number in phone_numbers.split(","):
numbers_to_names[phone_number] = name
然后当再次浏览CSV1时,跳过第一行,然后解析发件人和收件人,并结合之前的字典结果。
for line in open("csv1", "r").splitlines():
sender, recipient = line.split(";")
print "%s;%s;%s" % (sender, recipient, numbers_to_names[recipient])