Question

尽管我仍在学习和探索使用熊猫的过程中，但我对熊猫还很陌生（仅几天就可以接触到它）。我的csv文件很大，包含数十万行。我的目标是将多行连接到基于多列的单行中。最重要的是，通过引用日期/时间以及以后需要包括的日期/时间。下面说明了我的csv文件。

       Body                      UDH               Original Sender ID           Received Date/Time
Hi John, Can You            ABC0010101                  GGQMS                   01/02/2001 01:03:19
Wait A moment?              ABC0010102                  GGQMS                   01/02/2001 01:03:20
Whats is                    050004000111              112233445566              01/03/2001 11:16:01
Carrine Doing               050004000112              112233445566              01/03/2001 11:16:01
Over There?                 050004000113              112233445566              01/03/2001 11:16:02
Where is                    CD10F1011                   zwerty                  01/03/2001 15:22:10
Your Homework?              CD10F1012                   zwerty                  01/03/2001 15:22:11
Order for Pizza             AACCDD55001               112233445566              01/04/2001 19:20:21
Now for cheap $.            AACCDD55002               112233445566              01/04/2001 19:20:22
John, you know              G0500781                    GGQMS                   01/04/2001 10:21:21
Where can I get it?         G0500782                    GGQMS                   01/04/2001 10:21:21

如上所示，这是我的csv文件。这里的UDH充当主键，因为我们可以识别主体所属的字符数（从第一个到第二个倒数）。另一部分是“接收日期/时间”，身体的第二部分迟到了1秒或可能超过1秒。

我设法将身体连接起来，但是，某些身体由第三部分组成，而我没有完全将身体连接起来。

以下是我当前的代码：

 def problem3():
    filep2 = pd.read_csv(r'/Users/John/Downloads/Practice1/my_r.csv')

    #data cleaning
    filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')
    filep2['UDH']=filep2['UDH'].astype(object)
    filep2['Original Sender ID']=filep2['Original Sender ID'].astype(object)
    filep2['Account User Name']=filep2['Account User Name'].astype(object)
    filep2['Body']=filep2['Body'].astype(str)
    filep2['UDH']=filep2['UDH'].str.strip()
    df = pd.DataFrame(filep2)

    #Filter null row in UDH column
    df=df[df['UDH'].notnull()]
    df=df.sort_values(by ='UDH')

    df['Body'] = df.apply(multiple_condition, axis=1)    
    df.to_csv(r'/Users/John/Downloads/Practice1/my_c.csv', index=False, header=True) 

def multiple_condition (df):
    if (df['UDH'].str.len() == 8):
         df=df.groupby(df[['UDH'].str[:7],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
         return df
    elif (df['UDH'].str.len() == 9):
         df= df.groupby(df[['UDH'].str[:8],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index() 
         return df
    elif (df['UDH'].str.len() == 10):
         df= df.groupby(df[['UDH'].str[:9],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index()
         return df
    elif (df['UDH'].str.len() == 11):
         df=df.groupby(df[['UDH'].str[:10],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index() 
         return df
    elif (df['UDH'].str.len() == 12):
         df=df.groupby(df[['UDH'].str[:11],'Original Sender ID','Received Date/Time'])['Body'].apply(' '.join).reset_index() 
         return df

以上代码给出了此主题/票证主题所述的错误。错误消息说明如下；

更新的错误消息

  Traceback (most recent call last):

  File "<ipython-input-85-8ca58b5f49ad>", line 1, in <module>
    runfile('/Users/syafiq/Downloads/RoutingPractice01.py', wdir='/Users/syafiq/Downloads')

  File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/John/Downloads/RoutingPractice01.py", line 79, in <module>
    problem3()

  File "/Users/John/Downloads/RoutingPractice01.py", line 35, in problem3
    filep2['Received Date/Time']= filep2['Received Date/Time'].astype('datetime64[ns]')

  File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in __getitem__
    indexer = self.columns.get_loc(key)

  File "/Users/John/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc

  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc

  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'Received Date/Time'

下面是所需的输出：

                Body                   Original Sender ID              Received Date/Time
Hi John, Can You Wait A Moment?             GGQMS                      01/02/2001 01:03:20
What Is carbine doing over there?        112233445566                  01/03/2001 11:16:02
Where is your homework?                    zwerty                      01/03/2001 15:22:11
Order for Pizza Now for cheap $          112233445566                  01/04/2001 19:20:22
John, you know where can I get it?          GGQMS                      01/04/2001 10:21:21

注意：我已经尝试了多种方法来获得上面想要的输出，但仍然无法解决/发生错误。我尝试了无数时间，使用了不同的方法，但仍然没有骰子，不断撞墙。 UDH是对身体进行分组的标识符

我还是Pandas的新手，有一段时间没有接触Python了。如果有人可以强调我在哪里做错了，我将不胜感激。同时也非常感谢您的帮助，以获取我想要的输出。

非常感谢，非常感谢！：）

Answer 1

不用apply()但直接使用groupby()可以得到（或多或少）预期结果

groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])

df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()

我仅使用io.StringIO()来模拟文件。

text = '''       Body                      UDH               Original Sender ID           Received Date/Time
Hi John, Can You            ABC0010101                  GGQMS                   01/02/2001 01:03:19
Wait A moment?              ABC0010102                  GGQMS                   01/02/2001 01:03:20
Whats is                    050004000111              112233445566              01/03/2001 11:16:01
Carrine Doing               050004000112              112233445566              01/03/2001 11:16:01
Over There?                 050004000113              112233445566              01/03/2001 11:16:02
Where is                    CD10F1011                   zwerty                  01/03/2001 15:22:10
Your Homework?              CD10F1012                   zwerty                  01/03/2001 15:22:11
Order for Pizza             AACCDD55001               112233445566              01/04/2001 19:20:21
Now for cheap $.            AACCDD55002               112233445566              01/04/2001 19:20:22
John, you know              G0500781                    GGQMS                   01/04/2001 10:21:21
Where can I get it?         G0500782                    GGQMS                   01/04/2001 10:21:21'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep='\s{2,}')

#df['Received Date/Time'] = df['Received Date/Time'].astype('datetime64[ns]')
#df['UDH'] = df['UDH'].astype(object)
#df['Original Sender ID'] = df['Original Sender ID'].astype(object)
#df['Account User Name'] = df['Account User Name'].astype(object)
#df['Body'] = df['Body'].astype(str)
#df['UDH'] = df['UDH'].str.strip()

#Filter null row in UDH column
#df = df[df['UDH'].notnull()]
#df = df.sort_values(by ='UDH')

#groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
#for name, data in groups:
    #print(name)
#    data['Received Date/Time'] = data['Received Date/Time'].min()
    #print(data)

groups = df.groupby([df['UDH'].str[:-1], 'Original Sender ID'])
df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max}).reset_index()

#groups = df.groupby([df['UDH'].str[:-1]])
#df2 = groups.agg({'Body':' '.join, 'Received Date/Time':max, 'Original Sender ID':min}).reset_index()

df2 = df2.sort_values('Received Date/Time')

pd.options.display.width = 200
print(df2)

结果

           UDH Original Sender ID                                Body   Received Date/Time
2    ABC001010              GGQMS     Hi John, Can You Wait A moment?  01/02/2001 01:03:20
0  05000400011       112233445566  Whats is Carrine Doing Over There?  01/03/2001 11:16:02
3     CD10F101             zwerty             Where is Your Homework?  01/03/2001 15:22:11
4      G050078              GGQMS  John, you know Where can I get it?  01/04/2001 10:21:21
1   AACCDD5500       112233445566    Order for Pizza Now for cheap $.  01/04/2001 19:20:22

AttributeError ：（“'str'对象没有属性'str'“，“发生在索引31978”）

1 个答案: