Question

我是Pandas的新手，正在寻找是否有更好的方法来实现以下目标的投入：

我可能有数百万种格式的记录：

>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]), 
... "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
   col A      col B
0  typeA  a.b/c/d/e
1  typeB  a:b:c:d:e
2  typeC  a.b.c.d.e

我需要将C列添加到数据帧中，对于typeA，a.b，对于typeB，对于c，这是我现在拥有的：

>>> def parseColB(s):
...     col_split=re.split('[:,/,.]',s)
...     if len(col_split) < 2:
...             return ""
...     return col_split[0]
...

我通过以下应用调用添加新列：

>>> s = s.assign(ColC = s["col B"].apply(parseColB))
>>> s
   col A      col B ColC
0  typeA  a.b/c/d/e    a
1  typeB  a:b:c:d:e    a
2  typeC  a.b.c.d.e    a

这种方法的问题是我在ColC中为typeA而不是“ a.b”得到了“ a”。有没有一种方法可以基于“ col A”值有效地添加ColC？

根据亨利的评论，尝试对这一建议进行重复。我几乎可以正常工作了：

>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]),
...  "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
   col A      col B
0  typeA  a.b/c/d/e
1  typeB  a:b:c:d:e
2  typeC  a.b.c.d.e
>>> choices = [s['col B'].str.split("/"), s['col B'].str.split(":"), s['col B'].str.split(".")]
>>> conditions = [s['col A'] == 'typeA', s['col A'] == 'typeB', s['col A'] == 'typeC']
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
   col A      col B            col C
0  typeA  a.b/c/d/e   [a.b, c, d, e]
1  typeB  a:b:c:d:e  [a, b, c, d, e]
2  typeC  a.b.c.d.e  [a, b, c, d, e]

要使用的更新选项适用，并且可以提供期望的结果。这是正确的方法还是任何其他优化方法都是可能的？

>>> choices = [s['col B'].str.split("/").apply(lambda x : x[0]), s['col B'].str.split(":").apply(lambda x : x[0]), s['col B'].str.split(".").apply(lambda x : x[0])]
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
   col A      col B col C
0  typeA  a.b/c/d/e   a.b
1  typeB  a:b:c:d:e     a
2  typeC  a.b.c.d.e     a

Answer 1

您可以这样做。

s["col C"] = s["col B"].str.split('/|:').apply(lambda x: x[0]).apply(lambda x: ''.join([x.split('\.')[0][0] if (len(x)>3) else x]))

输出

     col A     col B    col C
0   typeA   a.b/c/d/e   a.b
1   typeB   a:b:c:d:e   a
2   typeC   a.b.c.d.e   a

寻找一种使用Apply函数将列添加到数据框的有效方法

1 个答案: