如何将网页抓取的文本打印到一行上?

时间:2019-06-02 19:14:13

标签: python beautifulsoup

我正在尝试使用beautifulsoup从托运人网站上抓取跟踪信息。但是,html的格式不利于我尝试执行的操作。源代码文本中包含不必要的空格,这使我的输出变得混乱。理想情况下,我只想在此处获取日期,但是我会选择“已发货”和日期,只要它在同一行即可。

我尝试使用。replace(" ","").strip()失败。

Python脚本:

from bs4 import BeautifulSoup
import requests

TrackList = ["658744424"]


for TrackNum in TrackList:
    source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
    soup = BeautifulSoup(source, 'lxml')
    ShipDate = soup.find('p', class_="Track-meter-itemLabel text--center").text
    print(ShipDate)

HTML源代码:

<p class="Track-meter-itemLabel text--center">
<strong class="text--bold">
                          Shipped
                        </strong>
                        5/23/2019
                      </p>

这就是返回的内容。其他空格和空行。

                      Shipped

                    5/23/2019

3 个答案:

答案 0 :(得分:0)

尝试:

trac = [your html code above]
soup = BeautifulSoup(trac, "lxml")
soup.text.replace(' ','').replace('\n',' ').strip()

输出:

'Shipped  5/23/2019'

答案 1 :(得分:0)

您正在寻找的stripped_strings生成器已经内置在BeautifulSoup中,但这不是常识。

### Your code

for ShipDate in soup.find('p', class_="Track-meter-itemLabel text--center").stripped_strings:
    print(ShipDate)

输出:

Shipped
5/23/2019

答案 2 :(得分:0)

使用正则表达式

from bs4 import BeautifulSoup
import requests
import re

TrackList = ["658744424"]

for TrackNum in TrackList:
    source = requests.get('https://track.xpoweb.com/en-us/ltl-shipment/'+TrackNum+"/").text
    soup = BeautifulSoup(source, 'lxml')
    print(' '.join(re.sub(r'\s+',' ', soup.select_one('.Track-meter-itemLabel').text.strip()).split('\n')))