无法从网页中抓取某些格式混乱的字段

时间:2019-07-09 15:37:27

标签: python python-3.x web-scraping

我已经用python编写了一个脚本来从网页中获取一些项目。问题是我希望获取的内容不在标签,类或id中。我只对addressphone感兴趣。它们全部堆叠在p标签中。鉴于我试图以以下方式收集它们。

site address

我尝试过:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

res = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,'lxml')

address = soup.find(class_="memeberDirectory_details").find("p").text.split("Phone")[0].strip()
phone = soup.find(class_="memeberDirectory_details").find("p",text=re.compile("Phone:(.*)"))
print(address,phone)

这将产生(地址包含我不想要的名称):

Assemblers Inc.

2850 West Columbus Ave.


Chicago IL 60652

UNITED STATES
None

预期输出:

2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES

(773) 378-3000

2 个答案:

答案 0 :(得分:1)

您可以尝试使用以下代码提取地址和电话:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

address_soup = soup.select_one('.memeberDirectory_details > p')

# remove company name in <b> tag
for b in address_soup.select('b'):
    b.extract()

data = [val.strip() for val in address_soup.get_text(separator='|').split('|') if val.strip()]

address = [*takewhile(lambda k: 'Phone:' not in k, data)]
phone = [val.replace('Phone:', '').strip() for val in data if 'Phone:' in val]

print('Address:')
print('\n'.join(address))
print()

print('Phone:')
print('\n'.join(phone))

打印:

Address:
2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES

Phone:
(773) 378-3000

编辑:

要查找带有正则表达式的文本,可以执行以下操作:

phone = soup.find(class_="memeberDirectory_details").find(text=re.compile("Phone:(.*)"))
print(phone)

打印:

Phone: (773) 378-3000

答案 1 :(得分:0)

与其在<p>标签处查找和拆分,然后查找每个单独的字段,在<p>处拆分,然后将所有<br>项目存储在列表中,而不是在其中进行查找。如果列表的元素大小没有变化,则始终可以弹出列表的第一个元素。如果您想走这条路,可以在一个数字的第一个实例处拆分地址,但这会出错,因为其中包含一个数字的公司名称。