Question

我正试图从此标签中夺取澳大利亚

<tr>
<td>City</td>
<th>Sydney</th>
</tr>
<tr>
<td>Country</td>
<th>Australia</th>
</tr>

import re
from re import findall
a = '<tr>\n<td>Country</td>\n<th>Australia</th>\n</tr>'
country = re.findall(r'<tr><td>Country</td><th>(.*?)</th></tr>',a)
print country

result: []

这是html代码，我尝试使用import re和从re import findall来抢占澳大利亚

我希望结果是：澳大利亚，但这给了我结果：[]

我不希望使用beautifulsoup。谢谢

Answer 1

您刚刚在正则表达式中错过了换行符（\ n）：

$("#reveal5").toggle().css("background-color", "#FF2C42")

这是经过测试的regex

Answer 2

您可以像这样使用re.findall来仅专门查找pattern = '<tr>\\n<td>Country</td>\\n<th>(.*?)</th>\\n</tr>'标签：

th

如果您出于某些原因不想使用BeautfulSoup（如下所示）。

>>> import re
>>> html = '<tr>\n<td>Country</td>\n<th>Australia</th>\n</tr>'
>>> country = re.findall(r'<th>(.*?)</th>', html)[0]
>>> country
Australia

Answer 3

不确定在这种情况下为什么选择regex而不是bs4。对于将来的读者，这可与bs4 4.7.1一起使用。您可以将:contains伪类与相邻的同级组合器一起使用，以在th旁边的td中包含“国家”

from bs4 import BeautifulSoup as bs

html = '''
<tr>
<td>City</td>
<th>Sydney</th>
</tr>
<tr>
<td>Country</td>
<th>Australia</th>
</tr>
'''

soup = bs(html, 'lxml') # 'html.parser' if lxml not installed  
countries =  soup.select('td:contains(Country) + th')
if countries: print(countries[0].text)

如何从标签“澳大利亚”中获取“澳大利亚”

3 个答案: