Question

我有一个熊猫数据框df：

import pandas as pd

df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
              "type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
              "F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})

如下所示：

      F_ID  ID type
0        0   2    A
1  [7 8 9]   3    B
2     [10]   4    B
3        0   5    A
4      [2]   6    A
5        0   7    B
6        0   8    A
7        0   9    A
8        0  10    A

在此，F_ID是一列，该列根据某些计算来告诉哪些记录与该关节记录相匹配。它给出了匹配的ID值。因此，ID 3与ID 7和8匹配。

我想要所有B类型ID及其相关记录的列表。并在单独的列的F_ID列中提到匹配ID，则该列的编号为可以根据值而变化，如下所示：

ID  type F_ID_1  F_ID_2 
3    B    8      9
4    B    10      
7    B

我不需要提到的F_ID的值是B类型。例如，ID 3具有7、8、9作为匹配ID，但是由于第7个ID是B类型，因此不应将其称为F_ID，仅列出8和9。

如何在python中用熊猫做到这一点？

Answer 1

如果我了解您的意图，那么F_ID是列表的字符串表示形式？

如果是这样，请将其转换为实际列表：

import numpy as np
import pandas as pd

df = pd.DataFrame({"ID": [2,3,4,5,6,7,8,9,10],
      "type" :["A", "B", "B", "A", "A", "B", "A", "A", "A"],
      "F_ID" :["0", "[7 8 9]", "[10]", "0", "[2]", "0", "0", "0", "0"]})

# convert the string representations of list structures to actual lists
F_ID_as_series_of_lists = df["F_ID"].str.replace("[","").str.replace("]","").str.split(" ")

#type(F_ID_as_series_of_lists) is pd.Series, make it a list for pd.DataFrame.from_records
F_ID_as_records = list(F_ID_as_series_of_lists)

f_id_df = pd.DataFrame.from_records(list(F_ID_as_records)).fillna(np.nan)
f_id_df

现在，让我们将拆分的F_ID与原始DataFrame结合起来

combined_df = df.merge(f_id_df, left_index = True, right_index = True, how = "inner")
combined_df = combined_df.drop("F_ID", axis = 1).sort_values(["type", "ID"])
combined_df

但是，我们需要忽略在同一F_ID中ID中出现的type，即7是{{ 1}}我们希望将其排除在ID和type == "B"的位置，即使它在ID == 3的列表中。

为此，我们创建从type == "B" / F_ID到ID的映射。

type

现在要进行过滤，我们可能可以做一些令人印象深刻的联接，但是如果我们不得不回到这个例子，则更容易阅读此示例的查询：

F_ID

现在将此功能应用于每一行，并将确实以mapping_df = pd.DataFrame(combined_df.set_index(["ID", "type"]).stack()).reset_index().drop("level_2", axis = 1) mapping_df.columns = ["ID", "type", "F_ID"] mapping_df的形式显示为def is_fid_of_same_type(row, df): query = "ID == {row_fid} & type == '{row_type}'".format( row_fid = row["F_ID"], row_type = row["type"] ) matches_df = df.query(query) row["fid_in_type_id"] = len(matches_df) > 0 return row的行放在同一F_ID中。

ID

然后将type作为列表而不是单独的行，请依次使用df = mapping_df.apply(lambda row: is_fid_of_same_type(row, mapping_df), axis = 1) df = df[df["fid_in_type_id"] == False].drop("fid_in_type_id", axis = 1) df和F_ID。

DataFrame.groupby()

这将导致：

如何创建从列中获取的唯一值的熊猫数据框，没有重复项

1 个答案: