Question

我有一个只有一栏“注释”的csv文件。我想根据某些条件合并数据帧的行。

---
- name: Install some stuff.
  hosts: firstgroup, secondgroup
  remote_user: someuser
  become: yes

  tasks:  

     - name: Install docker 
       command: amazon-linux-extras install -y docker
       register: result
       failed_when: result.rc != 0

输入看起来像这样

输出

Input_data={'notes':
            ['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']}

df_in = pd.DataFrame(Input_data)

我想将其中包含output_Data={'notes': ['aaa','bbb','*hello','**my name is xyz', '(1) this is temp name', '(2) BTW how to solve this', '(3) with python','I don’t want this to be added ', 'I don’t want this to be added ']} df_out=pd.DataFrame(output_Data)或"*"的行与上面的行合并。因此输出将类似于

应保留其他无法合并的行。另外，在最后一行的情况下，因为没有正确的方法知道我们可以合并到什么范围，所以可以说仅添加下一行我解决了这个问题，但时间很长。任何更简单的方法

"(number)"

Answer 1

您可以使用掩码来避免for循环：

df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']})

special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for index in index_to_merge:
    if index!=len(df)-1:
        df.loc[index, 'row'] += ' ' + df.loc[index+1, 'row']

# We delete the rows that we just used to merge
index_to_delete = index_to_merge +1
df.drop(index_to_delete)

出：

    row
0   aaa
1   bbb
2   * hello
4   ** my name
6   is xyz
7   (1) this is
9   temp
10  name
11  (2) BTW
13  how to
14  solve this
15  (3) with python
17  I don’t want this to be added
18  I don’t want this to be added

您还可以将列转换为numpy数组，并使用numpy函数简化操作。首先，您可以使用np.where和np.isin查找必须合并的索引。这样，您就不必使用for循环遍历整个数组。

然后，您可以对相应的索引进行修饰。最后，您可以删除已合并的值。这可能是这样的：

list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]

# We merge the rows
for index_to_merge in np.where(ix)[0]:
    # Check if there we are not trying to merge with an out of bounds value
    if index_to_merge!=len(list_to_merge)-1:
        list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]

# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)

出：

['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
       'temp', 'name', '(2) BTW', 'how to', 'solve this',
       '(3) with python', 'I don’t want this to be added ',
       'I don’t want this to be added ']

Answer 2

下面的解决方案在句子的开头识别特殊字符，例如*，**和（number），并开始合并除最后一行以外的后面的行。

import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']})



pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.

#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
    i=1
    #Merging row until next selected index found and add merged rows to delete_index list
    while(index+i not in selected_index and index+i < len(df)-1):
        df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
        delete_index.append(index+i)
        i+=1


df.drop(delete_index,inplace=True)
#print(df)

输出：

    row
0   aaa
1   bbb
2   *hello
4   **my nameis xyz
7   (1)this istempname
11  (2)BTWhow tosolve this
15  (3)with pythonI don’t want this to be added
18  I don’t want this to be added

您可以根据需要重置索引。使用df.reset_index（）

Answer 3

我认为，当您设计逻辑以将df_in分为3个部分时，会更容易：top, middle and bottom。连接中间部分时，保持顶部和底部完整无缺。最后，将3个部分合并成df_out

首先，创建m1和m2遮罩，将df_in分为3个部分。

m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 =  ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes

接下来，创建groupby_mask以对行进行分组，并创建groupby和join：

groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)

Out[3110]:
notes
1                      * hello
2            ** my name is xyz
3        (1) this is temp name
4    (2) BTW how to solve this
5              (3) with python
Name: notes, dtype: object

最后，使用pd.concat连接top，middle_join，bottom

df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()

Out[3114]:
                            notes
0                             aaa
1                             bbb
2                         * hello
3               ** my name is xyz
4           (1) this is temp name
5       (2) BTW how to solve this
6                 (3) with python
7  I don’t want this to be added
8  I don’t want this to be added

根据条件合并数据帧的行

3 个答案: