我的数据框中有两列,“主题”和“描述”。我正在尝试通过拆分“主题”列中文本上的数据来清理“描述”列,因为该数据包含在“描述”的所有行中。
以下是“主题”列的摘录:
Subject
1 Question about the program
2 Technical issue with the site
以及“描述”列:
Description \
1 An HTML only email was received and a rough conversion is below.
Please refer to the Emails related list for the HTML contents of the
message. Question about the program Hello Hello I was wondering if there
is going to be a product review coming up soon?
2 An HTML only email was received and a rough conversion is below.
Please refer to the Emails related list for the HTML contents of the
message. Technical issue with the site Reviews I received emails stating
that I need to rewrite two of my reviews
例如,在第1行上,我希望在“描述”列的第一行中拆分“关于程序的问题”,并且仅捕获该字符串之后的文本。
我尝试过
df['Description'] = df.apply(lambda x: x['Description'].split(x['Subject'], 1), axis=1)['Description']
但没有运气,并且在描述中未包含标题的索引上出现错误“ TypeError:('必须为str或None,不浮动')”。我该如何处理不包含该确切文本的行,同时仍然拆分那些包含该文本的行?
任何帮助将不胜感激。谢谢。
我也尝试了建议的响应,但收到此错误。 IndexError: ('list index out of range', 'occurred at index 1')
答案 0 :(得分:3)
您需要将df['Description']
中的字符串拆分为Subject
中的特定值,并在拆分后取下一部分。
df.apply(lambda x: x['Description'].split(x['Subject'])[1], axis=1)
输出:
0 Hello Hello I was wondering if there is going...
1 Reviews I received emails stating that I need...