通过python字符串函数删除字符串额外字符

时间:2016-06-01 07:00:21

标签: python string python-2.7 python-3.x beautifulsoup

以下是我想从中提取位置信息的Web CSS。

<div class="location">
    <div class="listing-location">Location</div>
    <div class="location-areas">
    <span class="location">Al Bayan</span>
    ‪,‪
    <span class="location">Nepal</span>
    </div>
    <div class="area-description"> 3.3 km from Mall of the Emirates </div>
    </div>

我使用的Python Beautuifulsoup4代码是:

   try:
            title= soup.find('span',{'id':'listing-title-wrap'})
            title_result= str(title.get_text().strip())
            print "Title: ",title_result
    except StandardError as e:
            title_result="Error was {0}".format(e)
            print title_result

输出:

"Al Bayanأ¢â‚¬آھ,أ¢â‚¬آھ

                            Nepal"

如何将格式转换为以下内容

['Al Bayan', 'Nepal']

获取此输出的代码的第二行应该是什么

3 个答案:

答案 0 :(得分:1)

你读错了,只读了班级位置的跨度

soup = BeautifulSoup(html, "html.parser")
locList = [loc.text for loc in soup.find_all("span", {"class" : "location"})]
print(locList)

这打印出您想要的内容:

['Al Bayan', 'Nepal']

答案 1 :(得分:0)

有一个单行解决方案。将a视为您的字符串。

In [38]: [i.replace("  ","") for i in filter(None,(a.decode('unicode_escape').encode('ascii','ignore')).split('\n'))]
Out[38]: ['Al Bayan,', 'Nepal']

答案 2 :(得分:0)

您可以使用regexp仅过滤字母和空格:

>>> import re
>>> re.findall('[A-Za-z ]+', area_result)
['Al Bayan', ' Nepal']

希望它会有所帮助。