Python Map函数清理html标签

时间:2017-05-22 18:09:58

标签: python beautifulsoup

我有一个包含以下列

的数据框
 description
0   1221    <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch...
1   1522    <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p align="center">Ê</p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p>Ê</p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committe...

我想清理此作业描述列,只保留文本并删除html标记。

为此我创建了一个映射器函数,如下所示:

def html_parsing(x): 
    """ This function takes the input text and cleans the HTML tags from it

    """

    from bs4 import BeautifulSoup
    textcleaned=''
    #if row['desc'] is not None: 
    souptext=BeautifulSoup(x)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            textcleaned+=p.string
    #print textcleaned
    return text_cleaned

然后我创建一个新列并将此map函数传递给它。

job_description["cleaned_jd"]=map(html_parsing,job_description["description"])

但是它给了我一个map对象作为新列的结果,而不是清理后的文本。

description cleaned_jd
0   1221    <p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch... <map object at 0x1127a5c88>

知道出了什么问题吗?如果没有beautifulsoup有更简单的方法可以做到这一点

1 个答案:

答案 0 :(得分:0)

好吧,看来你的def中有命名问题。我不得不把文本清理变量绑起来。

description =['<p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committed to extending the same unmatch...',
'<p>Ê</p><p align="center">Ê<strong><u>Property and Casualty Sales Agent </u></strong></p><p align="center">Ê</p><p><strong>WHY WORK FOR METLIFE AUTO & HOME</strong><strong><sup>¨</sup></strong><strong>? </strong></p><p>Ê</p><p><strong>If you want a career that has the reach to affect people everywhere, this is the place to be. At MetLife Auto & Home, weÕre experts in providing products and services that allow our customers to enjoy life and build safety nets they can count on. WeÕre committe...']

from bs4 import BeautifulSoup


def html_parsing(x): 
    """ This function takes the input text and cleans the HTML tags from it

    """

    text_cleaned=''
    souptext=BeautifulSoup(x)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            text_cleaned+=p.string
    return text_cleaned


print (list(map(html_parsing,description)))

我的建议是使用之前的评论来使用soup.text:

[section.text for section in map(BeautifulSoup, description)]