需要用beautifulsoup刮文字

时间:2018-06-21 17:38:57

标签: python html

任何人都可以帮助我如何使用beautifulsoup从此代码中仅抓取下面的文本。

“失望的教练伯特·范·马维克说,澳大利亚要想留在世界杯上,就必须找到难题的最后一部分。在周四与丹麦1-1战平澳大利亚之后,澳大利亚队长迈里·杰迪纳克(Mile Jedinak)命中了VAR协助的点球。克里斯蒂安·埃里克森斯(Christian Eriksens)揭幕战之后,赢得了Socceroos在俄罗斯的第一分,给了澳大利亚”

<a href="website" target="_blank" rel="nofollow" onmouseover="ddrivetip('<em>Thu, 21 Jun 2018</em> <br/> Disappointed coach Bert van Marwijk said Australia have to find the last part of the puzzle if they are to stay in the World Cup after a 1-1 draw with Denmark on Thursday. Australia captain Mile Jedinak hit a VAR-assisted penalty to earn the Socceroos first point in Russia after Christian Eriksens opener, giving Australia []')" ;="" onmouseout="hideddrivetip()">Australias Van Marwijk says last part of puzzle missing at World Cup</a>

2 个答案:

答案 0 :(得分:0)

from bs4 import BeautifulSoup

html = """
<a href="website" 
    target="_blank" 
    rel="nofollow" 
    onmouseover="ddrivetip('<em>Thu, 21 Jun 2018</em> <br/> Disappointed coach Bert van Marwijk said Australia have to find the last part of the puzzle if they are to stay in the World Cup after a 1-1 draw with Denmark on Thursday. Australia captain Mile Jedinak hit a VAR-assisted penalty to earn the Socceroos first point in Russia after Christian Eriksens opener, giving Australia []')" ;="" 
    onmouseout="hideddrivetip()">
    Australias Van Marwijk says last part of puzzle missing at World Cup
</a>
"""

soup = BeautifulSoup(html, 'lxml')

for a in soup.find_all('a'):
    attr_text = a.attrs['onmouseover'][43:-4]                                                                                                                                     
    print(attr_text + a.text)

输出

Disappointed coach Bert van Marwijk said Australia have to find the
last part of the puzzle if they are to stay in the World Cup after a 
1-1 draw with Denmark on Thursday. Australia captain Mile Jedinak hit 
a VAR-assisted penalty to earn the Socceroos first point in Russia 
after Christian Eriksens opener, giving Australia Australias Van 
Marwijk says last part of puzzle missing at World Cup

答案 1 :(得分:0)

您可以使用a.attrs['onmouseover']

例如:

from bs4 import BeautifulSoup
import re
s = """<a href="website" target="_blank" rel="nofollow" onmouseover="ddrivetip('<em>Thu, 21 Jun 2018</em> <br/> Disappointed coach Bert van Marwijk said Australia have to find the last part of the puzzle if they are to stay in the World Cup after a 1-1 draw with Denmark on Thursday. Australia captain Mile Jedinak hit a VAR-assisted penalty to earn the Socceroos first point in Russia after Christian Eriksens opener, giving Australia []')" ;="" onmouseout="hideddrivetip()">Australias Van Marwijk says last part of puzzle missing at World Cup</a>"""
soup = BeautifulSoup(s, "html.parser")
val = soup.a.attrs['onmouseover']
m = re.search("\((.*?)\)", val)
if m:
    print(m.group())

输出:

('<em>Thu, 21 Jun 2018</em> <br/> Disappointed coach Bert van Marwijk said Australia have to find the last part of the puzzle if they are to stay in the World Cup after a 1-1 draw with Denmark on Thursday. Australia captain Mile Jedinak hit a VAR-assisted penalty to earn the Socceroos first point in Russia after Christian Eriksens opener, giving Australia []')