我有每个用户提交的文件清单。例如用户arjun001有5个文档,但是在2个不同的列中列出。而且可以重复。 例如。
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
myst="""
arjun001 /doc/Repo/a/Documents/PanCard.pdf /doc/app/b/Documents/approval.png
arjun001 /doc/Repo/a/Documents/PanCard.pdf /doc/app/b/Documents/download.png
arjun001 /doc/Repo/a/Documents/Occuation.pdf /doc/app/b/Documents/Income.jpg
sandip.123 /doc/Repo/a/Documents/PanCard.pdf /doc/app/b/Documents/Domicile.jpg
sandip.123 /doc/Repo/a/Documents/PanCard.pdf /doc/app/b/Documents/Bank.jpg
"""
u_cols=['user_id', 'document_path', 'doc_path']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=' ', names = u_cols)
如何找到每个用户的唯一文档?预期的输出看起来像这样...
user_id, documents
arjun001 /doc/Repo/a/Documents/PanCard.pdf
arjun001 /doc/app/b/Documents/approval.png
arjun001 /doc/app/b/Documents/download.png
arjun001 /doc/Repo/a/Documents/Occuation.pdf
arjun001 /doc/app/b/Documents/Income.jpg
sandip.123 /doc/Repo/a/Documents/PanCard.pdf
sandip.123 /doc/app/b/Documents/Domicile.jpg
sandip.123 /doc/app/b/Documents/Bank.jpg
答案 0 :(得分:2)
将melt
与drop_duplicates
一起使用:
df = (df.melt('user_id', value_name='documents')
.sort_values('user_id')
.drop_duplicates(['user_id','documents'])
.drop('variable', 1)
.reset_index(drop=True))
df = (df.set_index('user_id')
.unstack()
.reset_index(level=0, drop=True)
.reset_index(name='documents')
.sort_values('user_id')
.drop_duplicates(['user_id','documents'])
.reset_index(drop=True))
print (df)
user_id documents
0 arjun001 /doc/Repo/a/Documents/PanCard.pdf
1 arjun001 /doc/Repo/a/Documents/Occuation.pdf
2 arjun001 /doc/app/b/Documents/approval.png
3 arjun001 /doc/app/b/Documents/download.png
4 arjun001 /doc/app/b/Documents/Income.jpg
5 sandip.123 /doc/Repo/a/Documents/PanCard.pdf
6 sandip.123 /doc/app/b/Documents/Domicile.jpg
7 sandip.123 /doc/app/b/Documents/Bank.jpg