Question

我正在尝试从<li>标记

中提取内容

网站：http://snowload.atcouncil.org/index.php/component/vcpsnowload/item

我想通过输入地址来提取不同城市的内容。

Query Date :
August 04, 2017
Address :
gilbert
Latitude :
33.3528264
Longitude :
-111.789027
Elevation :
0 Feet
Elevation Limitation: ASCE 7* Ground Snow Load
Elevation ≤ 2,000 feet: Ground Snow Load is 0 psf

请找到我尝试提取内容的方法。

import requests
from bs4 import BeautifulSoup
page = requests.get("http://snowload.atcouncil.org/index.php/component/vcpsnowload/item")
soup = BeautifulSoup(page.content,'html.parser') 
div = soup.find("div",attrs={'class':'span5'})
print div.text

面临的问题就是它没有完全提取，只提取查询日期。

此外，我尝试使用不同的解析器，例如'html.parser'，'html5lib'，'lxml'，这会产生相同的结果。

如果可以使用Selenium和Python，请尝试提供一些解决方案。

Answer 1

您需要使用HTTP POST方法并在数据中发送位置，例如

8a74070...

似乎有些字符我的终端无法打印所以我添加了

$breakpoints = [400, 360, 240, 120, 60, 48, 36, 24, 12, 9, 6, 3, 1, null];
foreach($breakpoints as $breakpoint){
    if($data['months'] >= $breakpoint) {
        $data['months'] = $breakpoint ? $breakpoint : 12;
        break;
    }
}

更新：

从那里你可以得到li元素：

import requests
from bs4 import BeautifulSoup
import sys

data = {'optionCoordinate': '2','coordinate_address': 'gilbert'}

page = requests.post("http://snowload.atcouncil.org/index.php/component/vcpsnowload/item", data = data)
soup = BeautifulSoup(page.content,'html.parser') 
div = soup.find("div",attrs={'class':'span5'})
print (div.text.encode(sys.stdout.encoding, errors='replace'))

再次更新以写入CSV：

.encode(sys.stdout.encoding, errors='replace')

Answer 2

此代码将获取您在该页面上定位的每个<li></li>内的文字。

from bs4 import BeautifulSoup as BS
from requests import get

site = "http://snowload.atcouncil.org/index.php/component/vcpsnowload/item"
req = get(site)
soup = BS(req.text, 'html.parser')
div = soup.find('ul', attrs={'class', 'map-info'})
list_items = div.find_all('li')

for li in list_items:
    print(li.text)

Answer 3

在li代码中提取内容的自动解决方案

from bs4 import BeautifulSoup
import urllib2
import requests
import sys
from selenium import webdriver
chrome_path = r"/usr/bin/chromedriver"
driver = webdriver.Chrome(chrome_path)
driver.get("http://snowload.atcouncil.org/")
driver.find_element_by_xpath("""//*[@id="adminForm"]/fieldset/div/div[2]/div[2]/label""").click()
driver.find_element_by_xpath("""//*[@id="coordinate_address"]""").click()
cities = ['pheonix']
for city in cities:
    print (city)
    driver.find_element_by_xpath('//*[@id="coordinate_address"]').send_keys(city)
    driver.find_element_by_xpath('//*[@id="adminForm"]/fieldset/div/div[2]/button').click()
x = driver.current_url
#print x
Data = {'optionCoordinate': '2','coordinate_address': cities}
page = requests.post(x, data = Data)
soup = BeautifulSoup(page.content,'html.parser') 
div = soup.find('div', attrs={'class': 'span5'})
for li in div.find_all('li'):
    y = (li.text)
    print y

driver.close()

在Beautiful Soup或Selenium上的<li>标签内获取数据

3 个答案: