我有一个数据框,其中的列具有需要提取的粗体字母。有53000行和27列具有粗体的字母。
array(['Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30'],
dtype=object)
答案 0 :(得分:0)
使用pandas.Series.str.extractall:
import pandas as pd
lst = ['Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30']
df = pd.DataFrame(data=lst, columns=['text'])
result = df.text.str.extractall('<strong>(.+?)</strong>')
输出
0
match
1 0 CV Submitted
1 Feedback Pending
2 Client CV Review
3 Feedback Awaiting
2 0 CV Submitted
1 Feedback Pending
2 Interview 1
3 Scheduled
正则表达式模式'<strong>(.+?)</strong>'
将匹配<strong>
和</strong>
之间的所有内容,并尽可能少地显示文本。要了解有关正则表达式的更多信息,请参见here。