Question

我的目标是能够按列值对CSV文件的行进行分组，并执行反向操作。举一个例子，希望能够在这两种格式之间来回转换：

uniqueId, groupId, feature_1, feature_2
1, 100, text of 1, 10
2, 100, some text of 2, 20
3, 200, text of 3, 30
4, 200, more text of 4, 40
5, 100, another text of 5, 50

在groupId上分组：

uniqueId, groupId, feature_1, feature_2
1|2|5, 100, text of 1|some text of 2|another text of 5, 10|20|50
3|4, 200, text of 3|more text of 4, 30|40

假定分隔符（此处为|）不存在于数据中的任何位置。

我正在尝试使用Pandas来执行此转换。到目前为止，我的代码可以访问由groupId分组的行的单元格，但我不知道如何填充新的数据框。

如何完成我的方法以完成转换为所需的新df？

反向方法如何将新df转换回原来的df？

如果R是这项工作的更好工具，我也愿意接受R的建议。

import pandas as pd  

def getGroupedDataFrame(df, groupByField, delimiter):
''' Create a df with the rows grouped on groupByField, values separated by delimiter'''
    groupIds = set(df[groupByField])
    df_copy = pd.DataFrame(index=groupIds,columns=df.columns)
    # iterate over the different groupIds
    for groupId in groupIds:
        groupRows = df.loc[df[groupByField] == groupId]
        # for all rows of the groupId
        for index, row in groupRows.iterrows():
            # for all columns in the df
            for column in df.columns:
                print row[column]
                # this prints the value the cell
                # here append row[column] to its cell in the df_copy row of groupId, separated by delimiter

Answer 1

要执行分组，您可以在groupby上'groupId'，然后在每个组中执行与每列上给定分隔符的连接：

def group_delim(grp, delim='|'):
    """Join each columns within a group by the given delimiter."""
    return grp.apply(lambda col: delim.join(col))

# Make sure the DataFrame consists of strings, then apply grouping function.
grouped = df.astype(str).groupby('groupId').apply(group_delim)

# Drop the grouped groupId column, and replace it with the index groupId.
grouped = grouped.drop('groupId', axis=1).reset_index()

分组输出：

  groupId uniqueId                                   feature_1 feature_2
0     100    1|2|5  text of 1|some text of 2|another text of 5  10|20|50
1     200      3|4                    text of 3|more text of 4     30|40

反向过程也有类似的想法，但由于每一行都是一个唯一的组，因此您只需使用常规apply，而无需groupby：

def ungroup_delim(col, delim='|'):
    """Split elements in a column by the given delimiter, stacking columnwise"""
    return col.str.split(delim, expand=True).stack()

# Apply the ungrouping function, and forward fill elements that aren't grouped.
ungrouped = grouped.apply(ungroup_delim).ffill()

# Drop the unwieldy altered index for a new one.
ungrouped = ungrouped.reset_index(drop=True)

取消分组会产生原始数据：

  groupId uniqueId          feature_1 feature_2
0     100        1          text of 1        10
1     100        2     some text of 2        20
2     100        5  another text of 5        50
3     200        3          text of 3        30
4     200        4     more text of 4        40

要使用不同的分隔符，您只需将delim作为参数传递给apply：

foo.apply(group_delim, delim=';')

作为旁注，通常迭代DataFrames非常慢。只要有可能，你就会想要像我上面所做的那样使用矢量化方法。

Answer 2

R中的解决方案：

我定义了初始数据框（为清晰起见）

df <- data.frame(uniqueID = c(1,2,3,4,5),
           groupID = c(100,100,200,200,100),
           feature_1 = c("text of 1","some text of 2",
                       "text of 3", "more text of 4",
                       "another text of 5"),
           feature_2 = c(10,20,30,40,50), stringsAsFactors = F)

获取分组数据框：

# Group and summarise using dplyr
library(dplyr)
grouped <- df %>% group_by(groupID) %>% summarise_each(funs(paste(.,collapse = "|")))

输出：

grouped

 groupID uniqueID                                  feature_1 feature_2
    (dbl)    (chr)                                      (chr)     (chr)
1     100    1|2|5 text of 1|some text of 2|another text of 5  10|20|50
2     200      3|4                   text of 3|more text of 4     30|40

要取消组合并返回原始数据框：

library(stringr)
apply(grouped, 1, function(x)  {

        temp <- data.frame(str_split(x, '\\|'), stringsAsFactors = F)
        colnames(temp) <- names(x)
        temp

        }) %>%
        bind_rows()

输出：

  groupID uniqueID         feature_1 feature_2
    (chr)    (chr)             (chr)     (chr)
1     100        1         text of 1        10
2     100        2    some text of 2        20
3     100        5 another text of 5        50
4     200        3         text of 3        30
5     200        4    more text of 4        40

基于列

2 个答案: