这是简化的数据集:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
简化的数据集由以下代码生成:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 和 x1 分别代表每个字符的开始和结束位置。假设任意两个相邻字符之间的距离等于 0.1。换句话说,如果一个字符的 x0 与前一个字符的 x1 之差为 0.1,则这两个字符属于同一个字符串。如果这种差异大于 0.1,则该字符应该是新字符串的开头,等等。我需要生成一个字符串数据帧及其各自的 x0 和 x1,这是通过使用 .iterrows()< 循环遍历数据帧来完成的/p>
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
结果如下:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
还有其他更快的方法来实现这一目标吗?
答案 0 :(得分:1)
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
输出
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
棘手的部分是这个:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
这个想法是将差异列(相同)转换为 True 或 False 列,每次出现 True 时都意味着需要创建一个新组。 cumsum
将负责为每个组分配相同的 id。
按照@ShubhamSharma 的建议,您可以这样做:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
其他部分保持不变。