无法从布局凌乱的网页中获取所有名称

时间:2018-05-29 23:33:27

标签: python python-3.x web-scraping beautifulsoup python-requests

我编写了一个脚本来解析网页上的所有移动商店名称。当我运行我的脚本时,我得到了一些。如何从该页面获取此时姓氏为Parkway Mobile Home Park - Alabama的所有姓名?

webpage link

这是我迄今为止尝试过的:

import requests
from bs4 import BeautifulSoup

url = "replace with above link"

r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("table tr")
name = '\n'.join([item.get_text(strip=True) for item in items.select("td p strong") if "alabama" in item.text.lower()])
print(name)

输出如下:

Roberts Trailer Park - Alabama
Cloverleaf Trailer Park - Alabama
Longview Mobile Home Park - Alabama

2 个答案:

答案 0 :(得分:1)

页面的HTML非常差,所以它非常难看但是有效:

import requests
from bs4 import BeautifulSoup

url = "http://www.chattelmortgage.net/Alabama_mobile_home_parks.html"

r = requests.get(url)
soup = BeautifulSoup(r.text,"html")
table = soup.find('table', attrs={'class':'tablebg, tableBorder'})
print([item.text.strip()  for item in table.find_all("strong") if "alabama" in item.text.lower()])

答案 1 :(得分:1)

尝试使用html.parser代替lxml。另外,请尝试使用select_one('table tr'),而不要使用find_all('strong')。您还需要删除额外的空格和回车。

以下代码将返回预期的(491)记录:

import re
import requests
from bs4 import BeautifulSoup

url = "http://www.chattelmortgage.net/Alabama_mobile_home_parks.html"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('strong')
name = '\n'.join([re.sub('\s{2,}', ' ', re.sub('[\r\n]', '', item.text)).strip() for item in items if 'alabama' in item.text.lower()])
print(name)