如何从数据框中的列中提取强标签并追加或替换该单元格?

时间:2019-10-31 09:34:39

标签: python pandas

我有一个数据框,其中的列具有需要提取的粗体字母。有53000行和27列具有粗体的字母。

array(['Candidate initial submission',
      'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
      'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30'],
     dtype=object)

1 个答案:

答案 0 :(得分:0)

使用pandas.Series.str.extractall

import pandas as pd

lst = ['Candidate initial submission',
 'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
 'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30']


df = pd.DataFrame(data=lst, columns=['text'])

result = df.text.str.extractall('<strong>(.+?)</strong>')

输出

                         0
  match                   
1 0           CV Submitted
  1       Feedback Pending
  2       Client CV Review
  3      Feedback Awaiting
2 0           CV Submitted
  1       Feedback Pending
  2            Interview 1
  3              Scheduled

正则表达式模式'<strong>(.+?)</strong>'将匹配<strong></strong>之间的所有内容,并尽可能少地显示文本。要了解有关正则表达式的更多信息,请参见here