Question

我想通过将“值”列中的值折叠为每个ID的唯一值的单个列表，将一列中具有重复值的数据框转换为合并的数据框。 “值”列中的值已从“文本”列中提取，并且每个id的文本被分成多行，其中一个文本元素中的单词也可以出现在另一文本元素中。因此，这些值可能会出现在多个文本元素中，因此会被记录多次。

这是起始数据帧的子集（约200万行）：

  id                   text            value
0  a          text 123 text            [123]
1  a  text abc text foo bar  [abc, foo, bar]
2  a      text foo bar text       [foo, bar]
3  b          text xyz text            [xyz]
4  b                   text               []
5  b          text 456 text            [456]

我想将上面的数据框转换为下面的数据框，并且可以从文本字段中丢失信息。

  id           text                 value
0  a  text 123 text  [123, abc, foo, bar]
1  b  text xyz text            [xyz, 456]

我正在寻找一个将列表拆分为行，将分离出的列与入门数据框合并，然后使用pd.melt的过程。最后一步需要花费很长时间，但可能是必要的，因为我有另一个包含每个值信息的数据框，并且我想使用“值”列作为键来合并这两个数据框。但是我不认为可以在列表中使用多个ID？

  value   info
0   123  info1
1   456  info2
2   abc  info3
3   foo  info4
4   bar  info5
5   xyz  info6

中介目标：

  id           text value
0  a  text 123 text   123
1  b  text xyz text   xyz
2  a  text 123 text   abc
3  b  text xyz text   456
4  a  text 123 text   foo
6  a  text 123 text   bar

最终目标：

  id           text value   info
0  a  text 123 text   123  info1
1  a  text 123 text   456  info2
2  a  text 123 text   abc  info3
3  a  text 123 text   foo  info4
4  b  text xyz text   bar  info5
5  b  text xyz text   xyz  info6

Answer 1

我正在将agg和first组合的list和s=df.groupby('id').agg({'text':'first','value': lambda x : list(set(x.sum()))}) unnesting(s.reset_index(),['value']).merge(df1,on='value') Out[307]: value id text info 0 abc a text 123 text info3 1 foo a text 123 text info4 2 123 a text 123 text info1 3 bar a text 123 text info5 4 456 b text xyz text info2 5 xyz b text xyz text info6用于您的df，然后进行unnesting，然后合并

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

#include "MyTemplateClass.hpp"

int main()
{
    MyTemplateClass<int> obj;

    obj.func();

    return 0;
}

熊猫-根据列表列中的唯一值合并行

1 个答案: