无法从网站的登录页面获取所有名称

时间:2019-06-15 08:26:07

标签: python python-3.x web-scraping

我用python编写了一个脚本,以便从网页中获取不同学院的所有名称。该网站在其登录页面中仅存储50个名称。但是,只有单击名为button的{​​{1}}才能查看其余名称。我希望在不使用任何浏览器模拟器的情况下,从该页面 获取所有名称 ,因为我可以看到其余名称在show more members中的某些脚本标签中可用

Site address

我尝试过:

page source

上面的脚本仅获取前50个名称。

如何在不使用任何浏览器简化程序的情况下从该网页获取所有名称?

3 个答案:

答案 0 :(得分:2)

采用其他路线:

import re
import requests
from bs4 import BeautifulSoup

url = r'https://www.abhe.org/directory'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')


js_data = soup.find_all('script') # Get script tags
js_data_2 = [i for i in js_data if len(i) > 0] # Remove zero length strings
js_dict = {k:v for k, v  in enumerate(js_data_2)} # Create a dictionary for referencing
data = str(js_dict[10]) # Our target is key 10

# Clean up results
data2 = data.replace('<script>\r\n\t\tw2dc_map_markers_attrs_array.push(new w2dc_map_markers_attrs(\'e5d47824e4fcfb7ab0345a0c7faaa5d2\',','').strip()

# Split on left bracket
test1 = data2.split('[')

# Remove 'eval(' and zero-length strings
test2 = [i for i in test1 if len(i) > 0 and i != 'eval(']

# Use regex to find strings with numbers between double quotation marks
p = re.compile(r'"\d+"')
test3 = [i for i in test2 if p.match(i)]

# List comprenehsion for index value 6 items, which is the college name
# we also can replace double quotation marks.
college_list = sorted([test3[i].split(',')[6].replace('"','') for i in range(len(test3))])

输出:

In [116]: college_list
Out [116]: 
['Georgia Central University',
 'Northwest Baptist Theological Seminary',
 'Steinbach Bible College',
 'Yellowstone Christian College',
...]

答案 1 :(得分:2)

您可以使用正则表达式获取所有成员名称。您可以安全地将p减小为

p = re.compile(r'false,"\d+","(.*?)"')

py:

import requests, re

r = requests.get('https://www.abhe.org/directory/')
p = re.compile(r'\["\d+","[-0-9.]+","[-0-9.]+",false,false,"\d+","(.*?)"')
string = re.sub(r'#038;','', r.text)
string = re.sub(r'&#8217;',"'", string)
names = p.findall(string)
print(len(names))
print(sorted(names))

答案 2 :(得分:1)

使用requestsBeautifulSoup

import requests
from bs4 import BeautifulSoup

params = { "action": "w2dc_controller_request", "controller": "directory_controller", 
    "directories": "1", "paged": 1, }

link = 'https://www.abhe.org/wp-admin/admin-ajax.php'
college_name = []
count=2

while True:
    jsonData = requests.post(link,headers={"user-Agent":"Mozilla/5.0,Accept: application/json"},data=params).json() 
    soup = BeautifulSoup(jsonData['html'],"lxml")
    for item in soup.select("h2 > a[title]"):
        college_name.append(item.text)

    #check is last page of records
    if jsonData['hide_show_more_listings_button'] == 1:
        break

    params['paged'] = count
    count+=1

print(college_name)

O / P:

['Alaska Bible College', 'Alaska Christian College', 'Alberta Bible College', 'All Saints Bible College', 'Allegheny Wesleyan College', 'Ambrose University', 'America Evangelical University', 'American Baptist College', 'Appalachian Bible College', 'Arlington Baptist University', 'B. H. Carroll Theological Institute', 'Baptist Bible College & Graduate School of Theology', 'Baptist University of the Americas', 'Barclay College', 'Berkeley Christian College and Seminary', 'Bethany Global University', 'Bethel College', 'Bethesda University', 'Bethlehem College and Seminary', 'Beulah Heights University', 'Biblical Life Institute', 'Boise Bible College', 'Bridges Christian College', 'Briercrest College and Seminary', 'Brookes Bible College', 'Cairn University', 'Calvary Chapel Bible College', 'Calvary University', 'Canadian Southern Baptist Seminary and College', 'Carolina Christian College', 'Carolina College of Biblical Studies', 'Carver Baptist Bible College, Institute and Theological Seminary', 'Central Christian College of the Bible', 'Central Christian University of South Carolina', 'Christ Mission College', 'Clarks Summit University', 'Clear Creek Baptist Bible College', 'College of Biblical Studies-Houston', 'Columbia Bible College', 'Columbia International University', 'Crossroads Bible College', 'Dallas Christian College', 'Davis College', 'Ecclesia College', 'Emmanuel Bible College', 'Emmaus Bible College', 'Eston College', 'Eternity Bible College', 'Ezra University', 'Faith Baptist Bible College and Theological Seminary', 'Faith Bible College', 'Faith Bible Seminary', 'Family of Faith Christian University', 'Georgia Central University', 'God’s Bible School and College', 'Grace Christian University', 'Grace College of Divinity', 'Grace Mission University', 'Guido Bible College', 'Hayfield University', 'Heartland Christian College', 'Heritage Christian University', 'Heritage College & Seminary', 'Heritage Seminary', 'Highlands College', 'Hobe Sound Bible College', 'Hope International University', 'Horizon College & Seminary', 'Horizon University', 'Hudson Taylor University', 'Huntsville Bible College', 'In His Image Bible Institute International', 'Indian Bible College', 'Institute of Lutheran Theology', 'International Reformed University & Seminary', 'International University and Theological Seminary', 'Johnson University', 'Kansas Christian College', 'Kentucky Mountain Bible College', 'Kingswood University', 'Kuyper College', 'Lancaster Bible College | Capital Seminary & Graduate School', 'Latin American Bible Institute', 'Life Pacific College', 'Lincoln Christian University', 'Luther Rice College and Seminary', 'Manhattan Christian College', 'Master’s College & Seminary', 'Methodist Theological Seminary in America', 'Mid-South Christian College', 'Midwest University', 'Montana Bible College', 'Moody Bible Institute', 'Native American Bible College', 'Nazarene Bible College', 'New Hope Christian College', 'Northpoint Bible College', 'Northpoint Bible College Grand Rapids Campus', 'Northwest Baptist Theological Seminary', 'Oak Hills Christian College', 'Olivet University', 'Ozark Christian College', 'Pacific Bible College', 'Pacific Life Bible College', 'Pacific Rim Christian University', 'Penn View Bible Institute', 'Pillar College', 'Prairie College', 'Presbyterian Theological Seminary in America', 'Providence University College and Theological Seminary', 'Regional Christian University', 'Rio Grande Bible Institute', 'Robert E. Webber Institute for Worship Studies', 'Rocky Mountain College: A Centre for Biblical Education', 'Rosedale Bible College', 'Saint Louis Christian College', 'Saint Photios Orthodox Theological Seminary', 'Selma University', 'Simmons College of Kentucky', 'South Florida Bible College & Theological Seminary', 'Southeastern Baptist College', 'Southeastern University', 'Southern Bible Institute & College', 'Southern Reformed College & Seminary', 'Stark College and Seminary', 'Steinbach Bible College', 'SUM Bible College and Theological Seminary', 'Summit Christian College', 'Summit Pacific College', 'Texas Baptist Institute and Seminary', 'The Institute for Global Outreach Developments Int’l', 'The King’s University', 'The Salvation Army College for Officer Training', 'Theological University of the Caribbean', 'Tri-State Bible College', 'Trinity Bible College & Graduate School', 'Trinity College of Florida', 'Tyndale University College & Seminary', 'Union Bible College', 'Universidad Pentecostal Mizpa', 'Valor Christian College', 'Vanguard College', 'Veritas College International', 'Virginia Christian University', 'Washington University of Virginia', 'Wave Leadership College', 'Welch College', 'Western Biblical Theological Seminary', 'William Jessup University', 'Williamson Christian College', 'World Mission University', 'Yellowstone Christian College']