我正在尝试使用beautifulsoup从托运人网站上抓取跟踪信息。但是,html的格式不利于我尝试执行的操作。源代码文本中包含不必要的空格,这使我的输出变得混乱。理想情况下,我只想在此处获取日期,但是我会选择“已发货”和日期,只要它在同一行即可。
我尝试使用。replace(" ","")
和.strip()
失败。
Python脚本:
from bs4 import BeautifulSoup
import requests
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
ShipDate = soup.find('p', class_="Track-meter-itemLabel text--center").text
print(ShipDate)
HTML源代码:
<p class="Track-meter-itemLabel text--center">
<strong class="text--bold">
Shipped
</strong>
5/23/2019
</p>
这就是返回的内容。其他空格和空行。
Shipped
5/23/2019
答案 0 :(得分:0)
尝试:
trac = [your html code above]
soup = BeautifulSoup(trac, "lxml")
soup.text.replace(' ','').replace('\n',' ').strip()
输出:
'Shipped 5/23/2019'
答案 1 :(得分:0)
您正在寻找的stripped_strings
生成器已经内置在BeautifulSoup中,但这不是常识。
### Your code
for ShipDate in soup.find('p', class_="Track-meter-itemLabel text--center").stripped_strings:
print(ShipDate)
输出:
Shipped
5/23/2019
答案 2 :(得分:0)
使用正则表达式
from bs4 import BeautifulSoup
import requests
import re
TrackList = ["658744424"]
for TrackNum in TrackList:
source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
soup = BeautifulSoup(source, 'lxml')
print(' '.join(re.sub(r'\s+',' ', soup.select_one('.Track-meter-itemLabel').text.strip()).split('\n')))