如何在满足条件的特定索引处拆分pandas Dataframe。

时间:2017-11-23 03:20:12

标签: python pandas dataframe

我正在尝试导入聊天记录的csv文件并首先按日期拆分,然后在满足两个连续行之间的条件时将每天的聊天分成多个块。

然后我想把所有这些都放在一个字典中,其中key是一个日期,值是一个日期块的列表。

这是我到目前为止所做的。

import pandas as pd
from datetime import datetime

# import csv of chatlog
ktlk_csv = pd.read_csv(r'''C:\Users\Jaepil\PycharmProjects\test_pycharm/5years.csv''', encoding="utf-8")

df = pd.DataFrame(ktlk_csv)

# Date column is str type. Change it into timestamp so I can later calculate diff between two rows. 
df["Date"] = pd.to_datetime(df["Date"])

# criteria to separate chunks. 
chunk_tolerance = 900 # chat stopped more than 900 seconds
chunk_min = 5 # chat less than 5 lines is not a chunk. 

# First split the entire chat by day and put it in a list. 
df_byDate = []
for group in df.groupby(lambda x: df["Date"][x].day):
    df_byDate.append(group)

df["time_diff"] = df["Date"].diff()

我在想(伪代码)

chatChunks_byDate = {}

for table_by_day in df_byDate:
  list_of_chunks = table_by_day.split?(condition: table_by_day["time_diff"] <= 900(i.e, chunk_tolerance):)

  list_of_chunks = [ x for x in list_of_chunks if not len(x.index) < 5(i.e, chunk_min) ] 

  chatChunks_byDate[ (date of table_by_day) ] = list_of_chunks

所以结果看起来像

  

&GT;   chatChunks_byDate = {“12月12日”:[当天的块列表],“12月14日”:[当天的块列表] ....}

我尝试打印上面的一些内容来解决它,但是:

print(df.columns)
  

&GT;     索引(['日期','用户','消息','time_diff'],dtype ='对象')

我可以看到'time_diff'列已成功创建。

print(type(df_byDate[0]))
  

&GT;   class'tuple'

但为什么它是元组?我希望它是一个数据帧。

print(df_byDate[0])

>> 
(5,                    Date User                                   Message
0   2017-09-05 19:25:46  권문광                      권문광 invited 전은영 and.
1   2017-09-05 19:25:47  권문광                                   졸사찍자 졸사
2   2017-09-05 19:29:16  전은영                            ㅌㅌㅌㅌㅌㅌㅋㅋㅋ
.
.
.

元组[0]处的那个5是什么? [1]似乎是我正在寻找的Dataframe,但是[0]的值是多少?

很多事情让我感到困惑。

0 个答案:

没有答案