Question

这是简化的数据集：

   Character    x0    x1
0          T   0.0   1.0
1          h   1.1   2.1
2          i   2.2   3.2
3          s   3.3   4.3
5          i   5.5   6.5
6          s   6.6   7.6
8          a   8.8   9.8
10         s  11.0  12.0
11         a  12.1  13.1
12         m  13.2  14.2
13         p  14.3  15.3
14         l  15.4  16.4
15         e  16.5  17.5
16         .  17.6  18.6

简化的数据集由以下代码生成：

ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
    ch.append(s)
    x0.append(round(x1[-1]+0.1,1))
    x1.append(round(x0[-1]+1,1))

df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)

x0 和 x1 分别代表每个字符的开始和结束位置。假设任意两个相邻字符之间的距离等于 0.1。换句话说，如果一个字符的 x0 与前一个字符的 x1 之差为 0.1，则这两个字符属于同一个字符串。如果这种差异大于 0.1，则该字符应该是新字符串的开头，等等。我需要生成一个字符串数据帧及其各自的 x0 和 x1，这是通过使用 .iterrows()< 循环遍历数据帧来完成的/p>

string = []
x0 = []
x1 = []
for index, row in df.iterrows():
    if index == 0:
        string.append(row['Character'])
        x0.append(row['x0'])
        x1.append(row['x1'])
    else:
        if round(row['x0']-x1[-1],1) == 0.1:
            string[-1] += row['Character']
            x1[-1] = row['x1']
        else:
            string.append(row['Character'])
            x0.append(row['x0'])
            x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])

结果如下：

    String    x0    x1
0     This   0.0   4.3
1       is   5.5   7.6
2        a   8.8   9.8
3  sample.  11.0  18.6

还有其他更快的方法来实现这一目标吗？

Answer 1

您可以使用 groupby + agg：

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)

输出

  Character    x0    x1
0      This   0.0   4.3
1        is   5.5   7.6
2         a   8.8   9.8
3   sample.  11.0  18.6

棘手的部分是这个：

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

这个想法是将差异列（相同）转换为 True 或 False 列，每次出现 True 时都意味着需要创建一个新组。 cumsum 将负责为每个组分配相同的 id。

按照@ShubhamSharma 的建议，您可以这样做：

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)

# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()

其他部分保持不变。

Python - 使用 .iterrows() 以外的方法循环遍历数据帧

1 个答案: