我一直遇到一个简单的熊猫数据框问题,也许有人在...之前遇到过这种情况...
预先感谢您:)
您好,有两个数据帧df1和df2:
df1
unique_id timestamp
1 2019-01-21
2 2019-02-01
3 2019-04-05
4 2019-05-01
5 2019-05-12
... ...
df2
classification from to
A 2019-01-05 2019-02-02
B 2019-02-03 2019-02-28
C 2019-03-01 2019-04-05
D 2019-04-06 2019-05-03
E 2019-05-04 2019-05-31
... ... ...
我的目标是将df1中的每个时间戳与df2中的每个自 到日期间隔进行比较,并能够对每个 df1的unique_id 与df2的对应分类
我正在尝试这样的事情:
df1.loc[(df1['timestamp'] > df2['from]) & (df1['timestamp'] < df2['to']), 'class'] = df2['classification']´
始终会出现 ValueError:尽管两个datetime dtypes完全相同,但只能比较标记相同的Series对象, datetime64 [ns] ...
预期产量:
unique_id timestamp classification
1 2019-01-21 A
2 2019-02-01 A
3 2019-04-05 C
4 2019-05-01 D
5 2019-05-12 E
... ... ...
答案 0 :(得分:0)
我个人要做的是将时间戳转换为unix时间戳。
for row in df1['timestamp']:
row = int(mktime(row.timetuple())
对df2做同样的操作以获取您的开始和结束时间戳记,因此您可以使用编写的df1.loc[(df1['timestamp'] > df2['from]) & (df1['timestamp'] < df2['to']), 'class'] = df2['classification']´
而不获取错误消息
答案 1 :(得分:0)
尝试:
import numpy as np
现在代替
df1['timestamp'] > df2['from]
试试:
np.greater(df1['timestamp'],df2['from])
看起来您正在尝试获得正确\错误的答案。
可能要在这里看看:https://docs.scipy.org/doc/numpy/reference/routines.logic.html
答案 2 :(得分:0)
您正在混合两个数据帧的索引。您建议使用这种语法,按行进行比较。如果我们精简以下数据帧(大小不同),就可以看到它:
df1 = pd.DataFrame(
[[1, "2019-01-21"],
[2, "2019-02-01"],
[3, "2019-04-05"],
[4, "2019-04-05"],
[5, "2019-04-05"],
[6, "2019-04-05"],
[7, "2019-05-01"],
[8, "2019-05-12"]],
columns=["unique_id", "timestamp"])
df2 = pd.DataFrame([
["A", "2019-01-05", "2019-02-02"],
["D", "2019-04-06", "2019-05-03"],
["C", "2019-03-01", "2019-04-05"],
["B", "2019-02-03", "2019-02-28"],
["E", "2019-05-04", "2019-05-31"],],
columns=["classification", "from", "to"])
# Comparaison of different dataframes
print((df1['timestamp'] > df2['from']))
引发错误:
ValueError:只能比较标记相同的Series对象
此处,您要根据匹配的日期时间间隔进行比较。因此,您需要区分两个数据框。要将字符串数据转换为日期,pandas.to_datetime
做(doc)
这里是一种方法:
# import modules
import pandas as pd
df1 = pd.DataFrame(
[[1, "2019-01-21"],
[2, "2019-02-01"],
[3, "2019-04-05"],
[4, "2019-04-05"],
[5, "2019-04-05"],
[6, "2019-04-05"],
[7, "2019-05-01"],
[8, "2019-05-12"]],
columns=["unique_id", "timestamp"])
df2 = pd.DataFrame([
["A", "2019-01-05", "2019-02-02"],
["D", "2019-04-06", "2019-05-03"],
["C", "2019-03-01", "2019-04-05"],
["B", "2019-02-03", "2019-02-28"],
["E", "2019-05-04", "2019-05-31"],],
columns=["classification", "from", "to"])
# convert to datetime
df1["timestamp"] = pd.to_datetime(df1["timestamp"], format="%Y-%m-%d")
df2[["from", "to"]] = df2[["from", "to"]].apply(pd.to_datetime, format="%Y-%m-%d")
# Try to compare 2 different dataframes
# print((df1['timestamp'] > df2['from']))
class_column = []
for index, row in df1.iterrows():
class_fd2 = df2[(df2["from"] <= row["timestamp"]) & (df2["to"] >= row["timestamp"])]["classification"].values[0]
class_column.append(class_fd2)
df1["class1"] = class_column
print(df1)
# unique_id timestamp class1
# 0 1 2019-01-21 A
# 1 2 2019-02-01 A
# 2 3 2019-04-05 C
# 3 4 2019-04-05 C
# 4 5 2019-04-05 C
# 5 6 2019-04-05 C
# 6 7 2019-05-01 D
# 7 8 2019-05-12 E
您也可以在函数中执行此操作以应用于df1
:
def set_class(row):
return df2[(df2["from"] <= row["timestamp"]) & (
df2["to"] >= row["timestamp"])]["classification"].values[0]
# Process
df1["class2"] = df1.apply(set_class, axis=1)
print(df1)
# unique_id timestamp class1 class2
# 0 1 2019-01-21 A A
# 1 2 2019-02-01 A A
# 2 3 2019-04-05 C C
# 3 4 2019-04-05 C C
# 4 5 2019-04-05 C C
# 5 6 2019-04-05 C C
# 6 7 2019-05-01 D D
# 7 8 2019-05-12 E E