Question

我试图从一个网页上抓取一些数据。标记文本中有换行符和<br/>标记。我想只获得标签开头的电话号码。你能给我一个如何获得这个号码的建议吗？

以下是HTML代码：

<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>

在beautifulsoup中有没有办法在标签中获取文本，但只有那个文本没有被其他标签包围？第二件事：摆脱文本换行符和HTML换行符？

我使用BS4。

输出结果为：＆＃39; +421 48/471 78 14＆＃39;

你有什么想法吗？谢谢

Answer 1

html="""
<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

print soup.find("td").contents[0].strip() 
+421 48/471 78 14

print soup.find("td").next_element.strip()
+421 48/471 78 14

soup.find("td").contents[0].strip()找到我们获得第一个元素的tag的内容，并删除所有\n换行符str.strip()。

来自文档next_element：

字符串或标记的.next_element属性指向之后立即解析的内容

Answer 2

它对你有用吗？

>>> from bs4 import BeautifulSoup
>>> str = str.replace("\n", "") # get rid of newlines
>>> str = "<td>   +421 48/471 78 14    <br /><em>(bowling)</em></td>"
>>> for item in soup.td.children:
...   phone = item # first item is the phone number
...   break
... 
>>> phone
u'   +421 48/471 78 14    '
>>> phone.strip()
u'+421 48/471 78 14'
>>>

Answer 3

另一种方法是使用 decompose() method 去掉标签（从树中删除一个标签，然后完全销毁它及其内容）

from bs4 import BeautifulSoup

string = '''
<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>
'''

soup = BeautifulSoup(string, 'html.parser')
em = soup.select_one('em').decompose()

phone = soup.select_one('td').text.strip()
print(phone)

输出：

+421 48/471 78 14

在获取文本之前的文本是在python / bs4之前

3 个答案: