我有一个包含以下列
的数据框 description
0 1221 <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch...
1 1522 <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p align="center">Ê</p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p>Ê</p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committe...
我想清理此作业描述列,只保留文本并删除html标记。
为此我创建了一个映射器函数,如下所示:
def html_parsing(x):
""" This function takes the input text and cleans the HTML tags from it
"""
from bs4 import BeautifulSoup
textcleaned=''
#if row['desc'] is not None:
souptext=BeautifulSoup(x)
p_tags=souptext.find_all('p')
for p in p_tags:
if p.string:
textcleaned+=p.string
#print textcleaned
return text_cleaned
然后我创建一个新列并将此map函数传递给它。
job_description["cleaned_jd"]=map(html_parsing,job_description["description"])
但是它给了我一个map对象作为新列的结果,而不是清理后的文本。
description cleaned_jd
0 1221 <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch... <map object at 0x1127a5c88>
知道出了什么问题吗?如果没有beautifulsoup有更简单的方法可以做到这一点
答案 0 :(得分:0)
好吧,看来你的def中有命名问题。我不得不把文本清理变量绑起来。
description =['<p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch...',
'<p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p align="center">Ê</p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p>Ê</p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committe...']
from bs4 import BeautifulSoup
def html_parsing(x):
""" This function takes the input text and cleans the HTML tags from it
"""
text_cleaned=''
souptext=BeautifulSoup(x)
p_tags=souptext.find_all('p')
for p in p_tags:
if p.string:
text_cleaned+=p.string
return text_cleaned
print (list(map(html_parsing,description)))
我的建议是使用之前的评论来使用soup.text:
[section.text for section in map(BeautifulSoup, description)]