尽管我仍在学习和探索使用熊猫的过程中,但我对熊猫还很陌生(仅几天就可以接触到它)。我的csv文件很大,包含数十万行。我的目标是将多行连接到基于多列的单行中。最重要的是,通过引用日期/时间以及以后需要包括的日期/时间。下面说明了我的csv文件。
Body UDH Original Sender ID Received Date/Time
Hi John, Can You ABC0010101 GGQMS 01/02/2001 01:03:19
Wait A moment? ABC0010102 GGQMS 01/02/2001 01:03:20
Whats is 050004000111 112233445566 01/03/2001 11:16:01
Carrine Doing 050004000112 112233445566 01/03/2001 11:16:01
Over There? 050004000113 112233445566 01/03/2001 11:16:02
Where is CD10F1011 zwerty 01/03/2001 15:22:10
Your Homework? CD10F1012 zwerty 01/03/2001 15:22:11
Order for Pizza AACCDD55001 112233445566 01/04/2001 19:20:21
Now for cheap $. AACCDD55002 112233445566 01/04/2001 19:20:22
John, you know G0500781 GGQMS 01/04/2001 10:21:21
Where can I get it? G0500782 GGQMS 01/04/2001 10:21:21
如上所示,这是我的csv文件。这里的UDH充当主键,因为我们可以识别主体所属的字符数(从第一个到第二个倒数)。另一部分是“接收日期/时间”,身体的第二部分迟到了1秒或可能超过1秒。
我设法将身体连接起来,但是,某些身体由第三部分组成,而我没有完全将身体连接起来。
以下是我当前的代码:
def problem3():
filep2 = pd.read_csv(r'/Users/John/Downloads/Practice1/my_r.csv')
#data cleaning
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
filep2['UDH']=filep2['UDH'].astype(object)
filep2['Original Sender ID']=filep2['Original Sender ID'].astype(object)
filep2['Account User Name']=filep2['Account User Name'].astype(object)
filep2['Body']=filep2['Body'].astype(str)
filep2['UDH']=filep2['UDH'].str.strip()
df = pd.DataFrame(filep2)
#Filter null row in UDH column
df=df[df['UDH'].notnull()]
df=df.sort_values(by ='UDH')
df['Body'] = df.apply(multiple_condition, axis=1)
df.to_csv(r'/Users/John/Downloads/Practice1/my_c.csv', index=False, header=True)
def multiple_condition (df):
if (df['UDH'].str.len() == 8):
df=df.groupby(df[['UDH'].str[:7],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 9):
df= df.groupby(df[['UDH'].str[:8],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 10):
df= df.groupby(df[['UDH'].str[:9],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 11):
df=df.groupby(df[['UDH'].str[:10],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
elif (df['UDH'].str.len() == 12):
df=df.groupby(df[['UDH'].str[:11],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
return df
以上代码给出了此主题/票证主题所述的错误。错误消息说明如下;
更新的错误消息
Traceback (most recent call last):
File "<ipython-input-85-8ca58b5f49ad>", line 1, in <module>
runfile('/Users/syafiq/Downloads/RoutingPractice01.py', wdir='/Users/syafiq/Downloads')
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/John/Downloads/RoutingPractice01.py", line 79, in <module>
problem3()
File "/Users/John/Downloads/RoutingPractice01.py", line 35, in problem3
filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Received Date/Time'
下面是所需的输出:
Body Original Sender ID Received Date/Time
Hi John, Can You Wait A Moment? GGQMS 01/02/2001 01:03:20
What Is carbine doing over there? 112233445566 01/03/2001 11:16:02
Where is your homework? zwerty 01/03/2001 15:22:11
Order for Pizza Now for cheap $ 112233445566 01/04/2001 19:20:22
John, you know where can I get it? GGQMS 01/04/2001 10:21:21
注意:我已经尝试了多种方法来获得上面想要的输出,但仍然无法解决/发生错误。我尝试了无数时间,使用了不同的方法,但仍然没有骰子,不断撞墙。 UDH是对身体进行分组的标识符
我还是Pandas的新手,有一段时间没有接触Python了。如果有人可以强调我在哪里做错了,我将不胜感激。同时也非常感谢您的帮助,以获取我想要的输出。
非常感谢,非常感谢! :)
答案 0 :(得分:1)
不用apply()
但直接使用groupby()
可以得到(或多或少)预期结果
groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()
我仅使用io.StringIO()
来模拟文件。
text = ''' Body UDH Original Sender ID Received Date/Time
Hi John, Can You ABC0010101 GGQMS 01/02/2001 01:03:19
Wait A moment? ABC0010102 GGQMS 01/02/2001 01:03:20
Whats is 050004000111 112233445566 01/03/2001 11:16:01
Carrine Doing 050004000112 112233445566 01/03/2001 11:16:01
Over There? 050004000113 112233445566 01/03/2001 11:16:02
Where is CD10F1011 zwerty 01/03/2001 15:22:10
Your Homework? CD10F1012 zwerty 01/03/2001 15:22:11
Order for Pizza AACCDD55001 112233445566 01/04/2001 19:20:21
Now for cheap $. AACCDD55002 112233445566 01/04/2001 19:20:22
John, you know G0500781 GGQMS 01/04/2001 10:21:21
Where can I get it? G0500782 GGQMS 01/04/2001 10:21:21'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
#df['Received Date/Time'] = df['Received Date/Time'].astype('datetime64[ns]')
#df['UDH'] = df['UDH'].astype(object)
#df['Original Sender ID'] = df['Original Sender ID'].astype(object)
#df['Account User Name'] = df['Account User Name'].astype(object)
#df['Body'] = df['Body'].astype(str)
#df['UDH'] = df['UDH'].str.strip()
#Filter null row in UDH column
#df = df[df['UDH'].notnull()]
#df = df.sort_values(by ='UDH')
#groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
#for name, data in groups:
#print(name)
# data['Received Date/Time'] = data['Received Date/Time'].min()
#print(data)
groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()
#groups = df.groupby([df['UDH'].str[:-1]])
#df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max, 'Original Sender ID':min}).reset_index()
df2 = df2.sort_values('Received Date/Time')
pd.options.display.width = 200
print(df2)
结果
UDH Original Sender ID Body Received Date/Time
2 ABC001010 GGQMS Hi John, Can You Wait A moment? 01/02/2001 01:03:20
0 05000400011 112233445566 Whats is Carrine Doing Over There? 01/03/2001 11:16:02
3 CD10F101 zwerty Where is Your Homework? 01/03/2001 15:22:11
4 G050078 GGQMS John, you know Where can I get it? 01/04/2001 10:21:21
1 AACCDD5500 112233445566 Order for Pizza Now for cheap $. 01/04/2001 19:20:22