我正在尝试导入聊天记录的csv文件并首先按日期拆分,然后在满足两个连续行之间的条件时将每天的聊天分成多个块。
然后我想把所有这些都放在一个字典中,其中key是一个日期,值是一个日期块的列表。
这是我到目前为止所做的。
import pandas as pd
from datetime import datetime
# import csv of chatlog
ktlk_csv = pd.read_csv(r'''C:\Users\Jaepil\PycharmProjects\test_pycharm/5years.csv''', encoding="utf-8")
df = pd.DataFrame(ktlk_csv)
# Date column is str type. Change it into timestamp so I can later calculate diff between two rows.
df["Date"] = pd.to_datetime(df["Date"])
# criteria to separate chunks.
chunk_tolerance = 900 # chat stopped more than 900 seconds
chunk_min = 5 # chat less than 5 lines is not a chunk.
# First split the entire chat by day and put it in a list.
df_byDate = []
for group in df.groupby(lambda x: df["Date"][x].day):
df_byDate.append(group)
df["time_diff"] = df["Date"].diff()
我在想(伪代码)
chatChunks_byDate = {}
for table_by_day in df_byDate:
list_of_chunks = table_by_day.split?(condition: table_by_day["time_diff"] <= 900(i.e, chunk_tolerance):)
list_of_chunks = [ x for x in list_of_chunks if not len(x.index) < 5(i.e, chunk_min) ]
chatChunks_byDate[ (date of table_by_day) ] = list_of_chunks
所以结果看起来像
&GT; chatChunks_byDate = {“12月12日”:[当天的块列表],“12月14日”:[当天的块列表] ....}
我尝试打印上面的一些内容来解决它,但是:
print(df.columns)
&GT; 索引(['日期','用户','消息','time_diff'],dtype ='对象')
我可以看到'time_diff'列已成功创建。
print(type(df_byDate[0]))
&GT; class'tuple'
但为什么它是元组?我希望它是一个数据帧。
print(df_byDate[0])
>>
(5, Date User Message
0 2017-09-05 19:25:46 권문광 권문광 invited 전은영 and.
1 2017-09-05 19:25:47 권문광 졸사찍자 졸사
2 2017-09-05 19:29:16 전은영 ㅌㅌㅌㅌㅌㅌㅋㅋㅋ
.
.
.
元组[0]处的那个5是什么? [1]似乎是我正在寻找的Dataframe,但是[0]的值是多少?
很多事情让我感到困惑。