好吧,我已经看了6个小时了,无法弄清楚这一点。我想使用Beautifulsoup来过滤网页上的数据,但是我无法使用.contents或get_text()来工作,我不知道我哪里出错或者如何在第一遍中做另一个过滤器。我可以访问“fields tag”但不能缩小到
标签来获取数据。对不起,如果这是一个我做错的简单问题,我昨天才开始使用Python,并在今天早上开始(尝试至少)网页抓取。
整个代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from openpyxl import Workbook
import bs4 as bs
import math
book = Workbook()
sheet = book.active
i=0
#Change this value to your starting tracking number
StartingTrackingNumber=231029883
#Change this value to increase or decrease the number of tracking numbers you want to search overal
TrackingNumberCount = 4
#Number of Tacking Numbers Searched at One Time
QtySearch = 4
#TrackingNumbers=["Test","Test 2"]
for i in range(0,TrackingNumberCount):
g=i+StartingTrackingNumber
sheet.cell(row=i+1,column=1).value = 'RN' + str(g) + 'CA,'
TrackingNumbers = []
for col in sheet['A']:
TrackingNumbers.append(col.value)
MaxRow = sheet.max_row
MaxIterations = math.ceil(MaxRow / QtySearch)
#print(MaxIterations)
RowCount = 0
LastTrackingThisPass = QtySearch
for RowCount in range (0,MaxIterations): #range(1,MaxRow):
FirstTrackingThisPass = (RowCount)*QtySearch
x = TrackingNumbers[FirstTrackingThisPass:LastTrackingThisPass]
LastTrackingThisPass+=QtySearch
driver = webdriver.Safari()
driver.set_page_load_timeout(20)
driver.get("https://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=e1s1")
driver.find_element_by_xpath('//*[contains(@id, "trackNumbers")]').send_keys(x)
driver.find_element_by_xpath('//*[contains(@id, "submit_button")]').send_keys(chr(13))
driver.set_page_load_timeout(3000)
WebDriverWait(driver,30).until(EC.presence_of_element_located((By.ID, "noResults_modal")))
SourceCodeTest = driver.page_source
#print(SourceCodeTest)
Soup = bs.BeautifulSoup(SourceCodeTest, "lxml") #""html.parser")
z = 3
#for z in range (1,5):
# t = str(z)
# NameCheck = "trackingNumber" + t
##FindTrackingNumbers = Soup.find_all("div", {"id": "trackingNumber3"})
# FindTrackingNumbers = Soup.find_all("div", {"id": NameCheck})
# print(FindTrackingNumbers)
Info = Soup.find_all("fieldset", {"class": "trackhistoryitem"}, "strong")
print(Info.get_text())
期望的输出:
RN231029885CA N / A
RN231029884CA N / A
RN231029883CA 2017/04/04
尝试解析的HTML示例:
<fieldset class="trackhistoryitem">
<p><strong>Tracking No. </strong><br><input type="hidden" name="ID_RN231029885CA" value="false">RN231029885CA
</p>
<p><strong>Date / Time </strong><br>
<!--h:outputText value="N/A" rendered="true"/>
<h:outputText value="N/A - N/A" rendered="false"/>
<h:outputText value="N/A" rendered="false"/-->N/A
</p>
<p><strong>Description </strong><br><span id="tapListResultForm:tapResultsItems:1:trk_rl_div_1">
答案 0 :(得分:1)
使用.get_text()
我找回了这个长丑陋的字符串:
'\nTracking No. RN231029885CA\n \nDate / Time \nN/A\n \nDescription '
所以使用一些pythons字符串函数:
objects = []
for each in soup.find_all("fieldset"):
each = each.get_text().split("\n") #split the ugly string up
each = [each[1][-13:], each[4]] #grab the parts you want, rmv extra words
objects.append(each)
注意:这假设所有跟踪号都是13位数,如果不是,您需要使用正则表达式或其他一些创意方法来提取它。