Question

我有一个研究参与者的数据框，其ID以以下格式“ 0000.000”存储。前四位数字是他们的家庭ID号，后三位数字是他们在家庭中的个人索引。大多数人的后缀为“ .000”，但有些人的后缀为“ .001”，“。002”等。

由于效率低下，这些数字被存储为浮点数。我正在尝试将它们导入为字符串，以便可以在与格式正确的另一个数据帧的连接中使用它们。

那些以.000结尾的ID被导入为“ 0000”，而不是“ 0000.000”。其他所有文件均已正确导入。

我正在尝试遍历这些ID，并在缺少后缀的ID后面附加“ .000”。

如果我使用R，我可以这样做。

df %>% mutate(StudyID = ifelse(length(StudyID)<5,
                               paste(StudyID,".000",sep=""),
                               StudyID)

我已经找到了一个Python解决方案（如下），但是它很时髦。

row = 0
for i in df["StudyID"]:
    if len(i)<5:
        df.iloc[row,3] = i + ".000"
    else: df.iloc[row,3] = i
    index += 1

我认为将其作为列表理解是理想的，但是我一直找不到能够让我遍历该列，一次更改一个值的解决方案。

例如，此解决方案迭代并正确检查逻辑，但是它将替换在每次迭代过程中求值为True的每个值。我只想更改当前正在评估的值。

[i + ".000" if len(i)<5 else i for i in df["StudyID"]]

这可能吗？

Answer 1

正如您所说，您的代码正在解决问题。可以做的另一种我想的方式是：

# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"

Answer 2

我认为 length(StudyID) 是指 nchar(StudyID)，正如 @akrun 指出的那样。

您可以在 python 中使用 datar 以 dplyr 方式完成：

>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>> 
>>> df = tibble(
...     StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
    StudyID
   <object>
0      0000
1      0001
2  0000.000
3  0001.001
>>> 
>>> df >> mutate(StudyID=if_else(
...   nchar(f.StudyID) < 5,
...   paste(f.StudyID, ".000", sep=""), 
...   f.StudyID
... ))
    StudyID
   <object>
0  0000.000
1  0001.000
2  0000.000
3  0001.001

免责声明：我是 datar 软件包的作者。

Answer 3

最终，我需要针对几个不同的数据帧执行此操作，因此我最终定义了一个函数来解决该问题，以便可以将其应用于每个对象。

我认为列表理解的想法将变得太复杂，并且在审查时可能太难理解，所以我坚持使用普通的for循环。

def create_multi_index(data, col_to_split, sep = "."):
    """
    This function loops through the original ID column and splits it into 
        multiple parts (multi-IDs) on the defined separator.
        By default, the function assumes the unique ID is formatted like a decimal number
    The new multi-IDs are appended into a new list. 
        If the original ID was formatted like an integer, rather than a decimal
            the function assumes the latter half of the ID to be ".000"
    """    

    # Take a copy of the dataframe to modify
    new_df = data

    # generate two new lists to store the new multi-index
    Family_ID = []
    Family_Index = []

    # iterate through the IDs, split and allocate the pieces to the appropriate list
    for i in new_df[col_to_split]:

        i = i.split(sep)

        Family_ID.append(i[0])

        if len(i)==1:
            Family_Index.append("000")
        else: 
            Family_Index.append(i[1])

    # Modify and return the dataframe including the new multi-index
    return new_df.assign(Family_ID = Family_ID,
                         Family_Index = Family_Index)

这将返回一个重复的数据框，其中多重ID的每个部分都有一个新列。

以这种形式的ID联接数据帧时，只要两个数据帧都具有相同格式的多重索引，这些列就可以与pd.merge一起使用，如下所示：

pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

遍历并有条件地将字符串值附加到Pandas数据框中

3 个答案: