为什么我要从网页中提取的数据不在我的汤页面中?

时间:2018-10-28 16:25:57

标签: python web-scraping beautifulsoup

The web page I am attempting to extract data from.

Picture of the data I am trying to extract.我要提取测试代码,CPT代码,首选标本,最小体积,运输容器和运输温度。

当我打印汤页面时,它不包含我需要的数据。因此,我无法提取它。这是我打印汤页面的方法:

soup_page = soup(html_page, "html.parser")
result = soup_page
print(result)

但是,当我从网页检查感兴趣的元素时,我可以看到HTML包含感兴趣的数据。以下是一些HTML:

<h4>Test Code</h4><p>36127</p><span class="LegacyOrder" style="word-wrap:break-word;visibility:hidden"></span><input id="primaryTestCode" value="36127" type="hidden"><input id="searchStringValue" value="36127" type="hidden"><span class="LisTranslatableVerbiage" style="word-wrap:break-word;visibility:hidden"></span>

2 个答案:

答案 0 :(得分:0)

为了找到所需的页面,您必须首先从下拉菜单中选择测试区域,然后按“执行”按钮。要在Python中执行此操作,您将必须使用selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re, time
def get_page_data(_source):
  headers = ['Test Code', 'CPT Code(s)', 'Preferred Specimen(s)', 'Minimum Volume', 'Transport Container', 'Transport Temperature']
  d1 = list(filter(None, [i.text for i in _source.find('div', {'id':'labDetail'}).find_all(re.compile('h4|p'))]))
  return {d1[i]:d1[i+1] for i in range(len(d1)-1) if d1[i].rstrip() in headers}

d = webdriver.Chrome('/path/to/chromedriver')   
d.get('https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127')
_d = soup(d.page_source, 'html.parser')
_options = [i.text for i in _d.find_all('option', {'value':re.compile('[A-Z]+')})] 
_lab_regions = {}
for _op in _options:
  d.find_element_by_xpath(f"//select[@id='labs']/option[text()='{_op}']").click()
  try:
    d.find_element_by_xpath("//button[@class='confirm go']").click()
  except:
    d.find_element_by_xpath("//button[@class='confirm update']").click()
  _lab_regions[_op] = get_page_data(soup(d.page_source, 'html.parser'))
  time.sleep(2)


print(_lab_regions)

输出:

{'AL - Solstas Birmingham 2732 7th Ave South (866)281-9838 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'AZ - Tempe 1255 W Washington St (800)766-6721 (QSO)': {}, 'CA - Quest Diagnostics Infectious Disease, Inc. 33608 Ortega Hwy (800) 445-4032 (FDX)': {}, 'CA - Sacramento 3714 Northgate Blvd (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CA - San Jose 967 Mabury Rd (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CA - San Juan Capistrano 33608 Ortega Hwy (800) 642-4657 (SJC)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Temperature ': 'Room temperature'}, 'CA - Valencia 27027 Tourney Road (800) 421-7110 (SLI)': {}, 'CA - West Hills 8401 Fallbrook Ave (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CO - Midwest 695 S Broadway (866) 697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'CT - Wallingford 3 Sterling Dr (866)697-8378 (NEL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'FL - Miramar 10200 Commerce Pkwy (866)697-8378 (TMP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'FL - Tampa 4225 E Fowler Ave (866)697-8378 (TMP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'GA - Tucker 1777 Montreal Cir (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'IL - Wood Dale 1355 Mittel Blvd (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'IN - Indianapolis 2560 North Shadeland Avenue (317)803-1010 (MJV)': {'Test Code': '36127', 'CPT Code(s) ': '84443'}, 'KS - Lenexa 10101 Renner Blvd (866)697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MA - Marlborough 200 Forest Street (866) 697-8378 (NEL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MD - East Region - Baltimore, 1901 Sulphur Spring Rd (866) 697-8378) (PHP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MD - Baltimore 1901 Sulphur Spring Rd (866)697-8378 (QBA)': {'Test Code': '36127X', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL (0.7 mL minimum) serumPlasma is no longer acceptable', 'Minimum Volume ': '0.7 mL'}, 'MI - Troy 1947 Technology Drive (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'MO - Maryland Heights 11636 Administration Dr (866)697-8378 (STL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NC - Greensboro 4380 Federal Drive (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NJ - Teterboro 1 Malcolm Ave (866)697-8378 (QTE)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NM - Albuquerque 5601 Office Blvd ABQ (866) 697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NV - Las Vegas 4230 Burnham Ave (866)697-8378 (MET)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'NY - Melville 50 Republic Rd Suite 200 - (516) 677-3800 (QTE)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'OH - Cincinnati 6700 Steger Dr (866)697-8378 (WDL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'OH - Dayton 2308 Sandridge Dr. (937) 297 - 8305 (DAP)': {'Test Code': '36127'}, 'OK - Oklahoma City 225 NE 97th Street (800)891-2917 (DLO)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': 'PATIENT PREPARATION:SPECIMEN COLLECTION AFTER FLUORESCEIN DYE ANGIOGRAPHY SHOULDBE DELAYED FOR AT LEAST 3 DAYS. FOR PATIENTS ONHEMODIALYSIS, SPECIMEN COLLECTION SHOULD BE DELAYED FOR 2WEEKS. ACCORDING TO THE ASSAY MANUFACTURER SIEMENS:"SAMPLES CONTAINING FLUORESCEIN CAN PRODUCE FALSELYDEPRESSED VALUES WHEN TESTED WITH THE ADVIA CENTAUR TSH3ULTRA ASSAY."1 ML SERUMINSTRUCTIONS:THIS ASSAY SHOULD ONLY BE ORDERED ON PATIENTS 1 YEAR OF AGEOR OLDER. ORDERS ON PATIENTS YOUNGER THAN 1 YEAR WILL HAVEA TSH ONLY PERFORMED.', 'Minimum Volume ': '0.7 ML', 'Transport Container ': 'SERUM SEPARATOR TUBE (SST)', 'Transport Temperature ': 'ROOM TEMPERATURE'}, 'OR - Portland 6600 SW Hampton Street (800)222-7941 (SEA)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'PA - Erie 1526 Peach St (814)461-2400 (QER)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': 'Preferred Specimen Volume: 1.0 mlSpecimen Type SERUMSpecimen State Room temperaturePatient preparation: Specimen collection after fluoresceindye angiography should be delayed for at least 3 days. Forpatients on hemodialysis, specimen collection should bedelayed for 2 weeks. According to the assay manufacturerSiemens: Samples containing fluorescein can produce falselydepressed values when tested with the ADVIA Centaur TSH3Ultra Assay.STABILITYSerum:Room temperature: 7 daysRefrigerated: 7 daysFrozen: 28 days', 'Minimum Volume ': '0.7 ml', 'Transport Container ': 'Serum Separator'}, 'PA - Horsham 900 Business Center Dr (866)697-8378 (PHP)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'PA - Pittsburgh 875 Greentree Rd (866)697-8378 (QPT)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TN - Knoxville, 501 19th St, Trustee Towers – Ste 300 & 301 (866)MY-QUEST (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TN - Nashville 525 Mainstream Dr (866)697-8378 (SKB)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TX - Houston 5850 Rogerdale Road (866)697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'TX - Irving 4770 Regent Blvd (866)697-8378 (DAL)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}, 'VA - Chantilly 14225 Newbrook Dr (703)802-6900 (AMD)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Temperature ': 'Room temperature'}, 'WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)': {'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}}

具体来说,对于"WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)"实验室:

print(_lab_regions["WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)"])

输出:

{'Test Code': '36127', 'CPT Code(s) ': '84443', 'Preferred Specimen(s) ': '1 mL serum', 'Minimum Volume ': '0.7 mL', 'Transport Container ': 'Serum Separator Tube (SST®)', 'Transport Temperature ': 'Room temperature'}

答案 1 :(得分:0)

要让网站返回数据,您还需要包括cookie信息,该信息用于指定您要求的实验室。就您而言,SEA。可以很容易地将其添加为requests参数,如下所示:

from bs4 import BeautifulSoup
from operator import itemgetter
import requests

url = 'https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127'

cookies = {
    "TC11SelectedLabCode" : "SEA",
    "TC11SelectedLabName" : "WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)",
}

r = requests.get(url, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
data = [el.get_text(strip=True) for el in itemgetter(6, 8, 14, 16, 20, 22)(soup.find_all(['h4', 'p']))]
print(data)

这会给你:

['36127', '84443', '1 mL serum', '0.7 mL', 'Serum Separator Tube (SST®)', 'Room temperature']

您可能需要改进信息的提取,假设每次搜索返回的元素都是一致的。相反,您可以搜索所需的字段标题并使用下一个元素,例如:

from bs4 import BeautifulSoup
from operator import itemgetter
import requests

req_fields = ["Test Code", "CPT Code(s)", "Preferred Specimen(s)", "Minimum Volume", "Transport Container", "Transport Temperature"]
url = 'https://www.questdiagnostics.com/testcenter/TestDetail.action?tabName=OrderingInfo&ntc=36127&searchString=36127'

cookies = {
    "TC11SelectedLabCode" : "SEA",
    "TC11SelectedLabName" : "WA - Seattle 1737 Airport Way S (866)697-8378 (SEA)",
}

r = requests.get(url, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
i_fields = (el.get_text(strip=True) for el in soup.find_all(['h4', 'p']))
data = {field : next(i_fields) for field in i_fields if field in req_fields}

print(data)

提供包含以下内容的字典

{'Test Code': '36127', 'CPT Code(s)': '84443', 'Preferred Specimen(s)': '1 mL serum', 'Minimum Volume': '0.7 mL', 'Transport Container': 'Serum Separator Tube (SST®)', 'Transport Temperature': 'Room temperature'}