Question

我一直遇到一个简单的熊猫数据框问题，也许有人在...之前遇到过这种情况...

预先感谢您：）

您好，有两个数据帧df1和df2：

df1

unique_id    timestamp
1            2019-01-21
2            2019-02-01
3            2019-04-05
4            2019-05-01
5            2019-05-12
...          ...

df2

classification     from            to
A                  2019-01-05      2019-02-02
B                  2019-02-03      2019-02-28
C                  2019-03-01      2019-04-05
D                  2019-04-06      2019-05-03
E                  2019-05-04      2019-05-31
...                ...             ...

我的目标是将df1中的每个时间戳与df2中的每个自到日期间隔进行比较，并能够对每个 df1的unique_id 与df2的对应分类

我正在尝试这样的事情：

df1.loc[(df1['timestamp'] > df2['from]) & (df1['timestamp'] < df2['to']), 'class'] = df2['classification']´

始终会出现 ValueError：尽管两个datetime dtypes完全相同，但只能比较标记相同的Series对象， datetime64 [ns] ...

预期产量：

unique_id         timestamp        classification
1                 2019-01-21       A
2                 2019-02-01       A
3                 2019-04-05       C
4                 2019-05-01       D
5                 2019-05-12       E
...               ...              ...

Answer 1

我个人要做的是将时间戳转换为unix时间戳。

for row in df1['timestamp']:
    row = int(mktime(row.timetuple())

对df2做同样的操作以获取您的开始和结束时间戳记，因此您可以使用编写的df1.loc[(df1['timestamp'] > df2['from]) & (df1['timestamp'] < df2['to']), 'class'] = df2['classification']´而不获取错误消息

Answer 2

尝试：
import numpy as np
现在代替
df1['timestamp'] > df2['from]
试试：
np.greater(df1['timestamp'],df2['from])
看起来您正在尝试获得正确\错误的答案。
可能要在这里看看：https://docs.scipy.org/doc/numpy/reference/routines.logic.html

Answer 3

您正在混合两个数据帧的索引。您建议使用这种语法，按行进行比较。如果我们精简以下数据帧（大小不同），就可以看到它：

df1 = pd.DataFrame(
    [[1, "2019-01-21"],
    [2, "2019-02-01"],
    [3, "2019-04-05"],
    [4, "2019-04-05"],
    [5, "2019-04-05"],
    [6, "2019-04-05"],
    [7, "2019-05-01"],
    [8, "2019-05-12"]],
    columns=["unique_id", "timestamp"])

df2 = pd.DataFrame([
    ["A", "2019-01-05", "2019-02-02"],
    ["D", "2019-04-06", "2019-05-03"],
    ["C", "2019-03-01", "2019-04-05"],
    ["B", "2019-02-03", "2019-02-28"],
    ["E", "2019-05-04", "2019-05-31"],],
    columns=["classification", "from", "to"])

# Comparaison of different dataframes
print((df1['timestamp'] > df2['from']))

引发错误：

ValueError：只能比较标记相同的Series对象

此处，您要根据匹配的日期时间间隔进行比较。因此，您需要区分两个数据框。要将字符串数据转换为日期，pandas.to_datetime做(doc)

这里是一种方法：

# import modules
import pandas as pd

df1 = pd.DataFrame(
    [[1, "2019-01-21"],
    [2, "2019-02-01"],
    [3, "2019-04-05"],
    [4, "2019-04-05"],
    [5, "2019-04-05"],
    [6, "2019-04-05"],
    [7, "2019-05-01"],
    [8, "2019-05-12"]],
    columns=["unique_id", "timestamp"])

df2 = pd.DataFrame([
    ["A", "2019-01-05", "2019-02-02"],
    ["D", "2019-04-06", "2019-05-03"],
    ["C", "2019-03-01", "2019-04-05"],
    ["B", "2019-02-03", "2019-02-28"],
    ["E", "2019-05-04", "2019-05-31"],],
    columns=["classification", "from", "to"])

# convert to datetime
df1["timestamp"] = pd.to_datetime(df1["timestamp"], format="%Y-%m-%d")
df2[["from", "to"]] = df2[["from", "to"]].apply(pd.to_datetime, format="%Y-%m-%d")

# Try to compare 2 different dataframes
# print((df1['timestamp'] > df2['from']))

class_column = []
for index, row in df1.iterrows():
    class_fd2 = df2[(df2["from"] <= row["timestamp"]) & (df2["to"] >= row["timestamp"])]["classification"].values[0]
    class_column.append(class_fd2)
df1["class1"] = class_column
print(df1)
#    unique_id  timestamp class1
# 0          1 2019-01-21      A
# 1          2 2019-02-01      A
# 2          3 2019-04-05      C
# 3          4 2019-04-05      C
# 4          5 2019-04-05      C
# 5          6 2019-04-05      C
# 6          7 2019-05-01      D
# 7          8 2019-05-12      E

您也可以在函数中执行此操作以应用于df1：

def set_class(row):
    return df2[(df2["from"] <= row["timestamp"]) & (
        df2["to"] >= row["timestamp"])]["classification"].values[0]
# Process
df1["class2"] = df1.apply(set_class, axis=1)
print(df1)
#    unique_id  timestamp class1 class2
# 0          1 2019-01-21      A      A
# 1          2 2019-02-01      A      A
# 2          3 2019-04-05      C      C
# 3          4 2019-04-05      C      C
# 4          5 2019-04-05      C      C
# 5          6 2019-04-05      C      C
# 6          7 2019-05-01      D      D
# 7          8 2019-05-12      E      E

将日期时间数据框与期间数据框进行比较

3 个答案: