此问题与Split (explode) pandas dataframe string entry to separate rows类似,但包含有关添加范围的问题。
我有一个DataFrame:
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1,2,4-6 | bob@email.com |
+------+---------+----------------+
| John | NaN | john@email.com |
+------+---------+----------------+
| Mary | 1,2 | mary@email.com |
+------+---------+----------------+
| Jane | 1,3-5 | jane@email.com |
+------+---------+----------------+
我希望用逗号分隔Options
列以及为范围添加的行。
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1 | bob@email.com |
+------+---------+----------------+
| Bob | 2 | bob@email.com |
+------+---------+----------------+
| Bob | 4 | bob@email.com |
+------+---------+----------------+
| Bob | 5 | bob@email.com |
+------+---------+----------------+
| Bob | 6 | bob@email.com |
+------+---------+----------------+
| John | NaN | john@email.com |
+------+---------+----------------+
| Mary | 1 | mary@email.com |
+------+---------+----------------+
| Mary | 2 | mary@email.com |
+------+---------+----------------+
| Jane | 1 | jane@email.com |
+------+---------+----------------+
| Jane | 3 | jane@email.com |
+------+---------+----------------+
| Jane | 4 | jane@email.com |
+------+---------+----------------+
| Jane | 5 | jane@email.com |
+------+---------+----------------+
我怎样才能超越使用concat
和split
之类的参考SO文章所说的来实现这一目标?我需要一种方法来添加范围。
该文章使用以下代码来分割逗号描述的值(1,2,3
):
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
提前感谢您的建议!
更新2/14 示例数据已更新,以符合我当前的情况。
答案 0 :(得分:6)
如果我理解你的需要
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
df.Options=df.Options.apply(yourfunc)
df
Out[114]:
Name Options Email
0 Bob [1, 2, 4, 5, 6] bob@email.com
1 Jane [1, 3, 4, 5] jane@email.com
df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]:
Name Email 0
0 Bob bob@email.com 1.0
1 Bob bob@email.com 2.0
2 Bob bob@email.com 4.0
3 Bob bob@email.com 5.0
4 Bob bob@email.com 6.0
5 Jane jane@email.com 1.0
6 Jane jane@email.com 3.0
7 Jane jane@email.com 4.0
8 Jane jane@email.com 5.0
答案 1 :(得分:5)
从自定义替换功能开始:
c(1, 2)
将列名存储在某处,稍后我们将使用它们:
def replace(x):
i, j = map(int, x.groups())
return ','.join(map(str, range(i, j + 1)))
接下来,替换c = df.columns
中的项目,然后用逗号分隔:
df.Options
接下来,重塑您的数据并最终加载到新的数据框中:
v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
df = pd.DataFrame(
df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c
答案 2 :(得分:5)
我喜欢使用np.r_
和slice
我知道它看起来像一团糟,但美丽在旁观者的眼中。
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(
mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
))
r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()
df.loc[df.index.repeat(lens)].assign(Options=new)
Name Options Email
0 Bob 1 bob@email.com
0 Bob 2 bob@email.com
0 Bob 4 bob@email.com
0 Bob 5 bob@email.com
0 Bob 6 bob@email.com
2 Mary 1 mary@email.com
2 Mary 2 mary@email.com
3 Jane 1 jane@email.com
3 Jane 3 jane@email.com
3 Jane 4 jane@email.com
3 Jane 5 jane@email.com
解释
np.r_
使用不同的切片器和索引器并返回组合的数组。
np.r_[1, 4:7]
array([1, 4, 5, 6])
或
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
但如果我需要传递任意一组,我需要将tuple
传递给np.r_
__getitem__
方法。
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1, 4, 5, 6, 10, 11, 12, 13])
所以我迭代,解析,制作切片并传递给np.r_.__getitem__
在应用我的酷解析器后,使用loc
,pd.Index.repeat
和pd.Series.str.len
的组合
pd.DataFrame.assign
覆盖现有列 <强> __注__ 强>
如果您的Options
列中包含不良字符,我会尝试按此过滤。
df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
.replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
.query('Options != ""')
答案 3 :(得分:4)
这是一个解决方案。虽然它不漂亮(pandas
的最小使用),但效率很高。
import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable
def ranger(mystr):
return list(concat([int(i)] if '-' not in i else \
list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
for i in mystr.split(',')))
df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
['Jane', '1,3-5', 'jane@email.com']],
columns=['Name', 'Options', 'Email'])
df['Options'] = df['Options'].map(ranger)
lens = list(map(len, df['Options']))
df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
'Email': np.repeat(df['Email'].values, lens),
'Option': np.hstack(df['Options'].values)})
# Email Name Option
# 0 bob@email.com Bob 1
# 1 bob@email.com Bob 2
# 2 bob@email.com Bob 4
# 3 bob@email.com Bob 5
# 4 bob@email.com Bob 6
# 5 jane@email.com Jane 1
# 6 jane@email.com Jane 3
# 7 jane@email.com Jane 4
# 8 jane@email.com Jane 5
以下4个解决方案的基准(仅限兴趣)。
作为一般规则,repeat
品种的表现优异。此外,从头开始创建新数据帧的解决方案(而不是apply
)做得更好。下拉到numpy
可获得最佳效果。
import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable
def ranger(mystr):
return list(concat([int(i)] if '-' not in i else \
list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
for i in mystr.split(',')))
def replace(x):
i, j = map(int, x.groups())
return ','.join(map(str, range(i, j + 1)))
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(mm(list(map(int, s.strip().split('-')))) for s in o.split(',')))
df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
['Jane', '1,3-5', 'jane@email.com']],
columns=['Name', 'Options', 'Email'])
df = pd.concat([df]*1000, ignore_index=True)
def explode_jp(df):
df['Options'] = df['Options'].map(ranger)
lens = list(map(len, df['Options']))
df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
'Email': np.repeat(df['Email'].values, lens),
'Option': np.hstack(df['Options'].values)})
return df_out
def explode_cs(df):
c = df.columns
v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
df_out = pd.DataFrame(df.drop('Options', 1).values.repeat(v.str.len(), axis=0))
df_out.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df_out.columns = c
return df_out
def explode_wen(df):
df.Options=df.Options.apply(yourfunc)
df_out = df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
return df_out
def explode_pir(df):
r = df.Options.apply(parse)
df_out = df.loc[df.index.repeat(r.str.len())].assign(Options=np.concatenate(r))
return df_out
%timeit explode_jp(df.copy()) # 32.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_cs(df.copy()) # 90.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_wen(df.copy()) # 675 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_pir(df.copy()) # 163 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)