使用包含嵌套列表的现有列的​​出现总和创建一个新列

时间:2018-05-04 21:14:35

标签: python python-3.x pandas dataframe lambda

我有一个相对较大的数据框,如下所示:

(我已经在这里上传了csv文件 - ufile.io/526t4)

    value
0   [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"],[182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
1   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
2   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
3   [[20,79,"D"]]
...
12352   [[25,36,"S"],[37,89,"C"],[90,115,"S"]]
12353   [[1,16,"D"],[17,407,"C"],[408,416,"D"]]
12354   [[16,21,"D"],[22,108,"C"],[109,123,"D"],[124,164,"C"],[165,421,"S"]]
12355 rows × 1 columns

我想创建一个新列,其中包含所有“D”次出现的总和

以第一行为例:

x = [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"][121,181,"S"],182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
new_colum_D = (sum([y[1]-y[0] for y in x if y[2]=="D"])) # applied for all rows

new_colum_D =第一行值为130

我尝试了以下内容:

df['Column_D']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

但是我收到以下消息:IndexError:字符串索引超出范围

IndexError                                Traceback (most recent call last)
<ipython-input-7-f7f23d42d4e5> in <module>()
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if 
y[2]=="D"]))
~\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-7-f7f23d42d4e5> in <lambda>(x)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
<ipython-input-7-f7f23d42d4e5> in <listcomp>(.0)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

IndexError: string index out of range

1 个答案:

答案 0 :(得分:0)

你非常接近。您可以在列表推导中构建计算结构。然后将列表分配给一个系列。

您可能感觉您正在使用pd.DataFrame.apply对计算进行矢量化,但事实并非如此:apply只是一个带有额外开销的薄弱环路。< / p>

df = pd.DataFrame({'value': [[[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"], [182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]]})

df['value'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df['value']]

print(df)

   value
0    130
1      5
2      5