无法删除结果之间的巨大空格

时间:2017-10-19 13:32:51

标签: python python-3.x web-scraping data-cleaning

我在python中编写了一个脚本来从一些html元素中删除一些文本。我写的脚本可以解析它。然而,问题是数据被解析,它们之间有很大的空间。我尝试使用.strip()方法,但它对结果没有任何影响。我该如何解决?

html元素:

html="""
<div class="organisation-details">

    <div class="personnel shaded">
                        <h3>KEY PERSONNEL</h3>
                        <p>
                                Director: Andrew Bickerton<br>
                                Director: Andrew Connor<br>
                                Office Manager: Tom Marchant<br>
                        </p>
                    </div>

    <div class="company-type shaded">
                        <h3>COMPANY TYPE</h3>
                        <p>
                                                        Importer
                        </p>
                    </div>

    <div class="company-details shaded">
                        <h3>COMPANY DETAILS</h3>
                        <p>
                                Year Established: 1984 <br>
                                                        VAT No: GB 413 3611 93<br>
                                                        No of Employees: 1-20<br>
                        </p>
                    </div>


</div>
"""

这个脚本:

from lxml.html import fromstring

tree = fromstring(html)
for title in tree.cssselect(".organisation-details"):
    key = title.cssselect("h3:contains('KEY PERSONNEL')+p")[0].text_content().strip()
    details = title.cssselect("h3:contains('COMPANY DETAILS')+p")[0].text_content().strip()
    ctype = title.cssselect("h3:contains('COMPANY TYPE')+p")[0].text_content().strip()
    print(key,details,ctype)

我的输出:

Director: Andrew Bickerton
                                Director: Andrew Connor
                                Office Manager: Tom Marchant Year Established: 1984 
                                                        VAT No: GB 413 3611 93
                                                        No of Employees: 1-20 Importer

我追求的结果(或更接近的结果):

Director: Andrew Bickerton
Director: Andrew Connor
Office Manager: Tom Marchant 
Year Established: 1984 
VAT No: GB 413 3611 93
No of Employees: 1-20
Importer

2 个答案:

答案 0 :(得分:2)

问题是fatal error: 'try!' expression unexpectedly raised an error: Swift.DecodingError.dataCorrupted( Swift.DecodingError.Context( codingPath: [ Test_App.MovieList.CodingKeys.movies, Foundation.(_JSONKey in _12768CA107A31EF2DCE034FD75B541C9)(stringValue: "Index 0", intValue: Optional(0)), Test_App.Movie.CodingKeys.dateUpdated ], debugDescription: "Date string does not match format expected by formatter.", underlyingError: nil) ) keydetails在字符串中间包含多行和空格。您需要在换行符上拆分它们并删除每个项目。类似的东西:

ctype

并重复for piece in key.split('\n'): print(piece.strip()) details

答案 1 :(得分:0)

当浏览器向您显示该html时,它不会注意字符串开头和结尾的外部空格。 Python(或任何其他编程语言)从字面上理解字符串中的空格。巧合的是,昨天我在类似的情况下难过。