我正在尝试为Wikipedia页面提取一些信息,并且我正在使用Beautiful汤将文本加载到Python中,但是使用正则表达式似乎很难去除所有不必要的标签。
这是美丽汤中文本输出的示例
[<td colspan="3">
</td>, <td valign="top" width="400">
<ul><li><a href="/wiki/Aach,_Baden-W%C3%BCrttemberg" title="Aach, Baden-Württemberg">Aach</a> (<a href="/wiki/Baden-W%C3%BCrttemberg" title="Baden-Württemberg">Baden-Württemberg</a>)</li>
<li><a href="/wiki/Aachen" title="Aachen">Aachen</a> (<a href="/wiki/North_Rhine-Westphalia" title="North Rhine-Westphalia">North Rhine-Westphalia</a>)</li>
理想情况下,我想拥有城市(已分配给标题)和地区(位于行尾之前)。
任何帮助将不胜感激!
rows = soup.find_all('td')
list_rows = []
#remove html tags
for row in rows:
cells = row.find_all('li')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '', str_cells))
list_rows.append(clean2)
print(clean2)
答案 0 :(得分:2)
在这种情况下,您可以使用.find_next_sibling()
方法:
import re
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Germany'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for li in soup.select('td[width="400"] li'):
city = li.select_one('a')
if city.find_next_sibling('a'):
region = city.find_next_sibling('a').text
else:
region = city.find_next_sibling(text=True).strip()
print('{: <30}{}'.format(city.text, re.findall(r'[^()]+', region)[0]))
打印:
Aach Baden-Württemberg
Aachen North Rhine-Westphalia
Aalen Baden-Württemberg
Abenberg Bavaria
Abensberg Bavaria
Achern Baden-Württemberg
Achim Lower Saxony
Adelsheim Baden-Württemberg
Adenau Rhineland-Palatinate
Adorf Saxony
Ahaus North Rhine-Westphalia
Ahlen North Rhine-Westphalia
Ahrensburg Schleswig-Holstein
Aichach Bavaria
Aichtal Baden-Württemberg
Aken (Elbe) Saxony-Anhalt
Albstadt Baden-Württemberg
Alfeld Lower Saxony
Allendorf (Lumda) Hesse
Allstedt Saxony-Anhalt
...and so on.
答案 1 :(得分:0)