如何设置文本格式,使其看起来像网站上显示的那样?

时间:2019-07-01 20:30:57

标签: python selenium

我正在网上抓取一个使用Python硒的网站。这是网站上的代码:

enter image description here

,我希望文本看起来像网站上显示的那样,即易于换行,以一种有组织的格式“阅读”。

我尝试使用

driver.find_element_by_class_name('record-content.record-information.record-content_j').text

但是里面有\ n \ n个字符。我尝试过print(text),看起来更好。但是有没有一种方法可以将文本存储在数据框或其他内容中,从而以有组织的格式保存文本。 该网站如下所示:enter image description here 当我尝试

rawData=driver.find_element_by_class_name('record-content.record-information.record-content_j').text
sanitizedData = rawData.replace('\n','')
print(BeautifulSoup(sanitizedData, 'html.parser').prettify())

输出看起来像这样:

enter image description here 换行符的'br'标签刚刚消失了。

1 个答案:

答案 0 :(得分:0)

由于我们是通过f2内容提取的,没有包含class="participantName0"标签,因此我们可以使用BeautifulSoup来实际实现输出的真实化。同样,如果您想维护html,可以使用<tbody role="rowgroup"><tr data-uid="a3aa1580-63e9-4d91-a20e-cbec3b83989c" role="row" class="k-grid-edit-row"><td style="display:none" role="gridcell">0<input type="hidden" required="" name="CareConferenceParticipantList[0].ParticipantID" value="0"></td><td tabindex="-1" required="True" validationmessage="Enter something in this field" role="gridcell" id="CareConferenceParticipantList_active_cell" class="k-edit-cell" aria-describedby="cf01405d-7381-4265-a499-cadd57165446 CareConferenceParticipantList_active_cell" data-role="editable"> <input id="ParticipantName" name="ParticipantName" type="text" value="" data-bind="value:ParticipantName" maxlength="300" class="k-valid"><span class="field-validation-valid" data-valmsg-for="ParticipantName" data-valmsg-replace="true" style="display: none;"></span></td><td tabindex="-1" role="gridcell"><span> </span><input type="hidden" name="CareConferenceParticipantList[0].ParticipantRole" class="credential0" value="" maxlength="300"></td><td role="gridcell"><input type="checkbox" tabindex="-1" name="CareConferenceParticipantList[0].IsInPerson" style="margin-left:30px;" class="IsInPersonChkBx" value="false"></td><td role="gridcell"><a class="k-button k-button-icontext k-grid-deletethisrow" href="javascript:void(0)" tabindex="-1"><span class="glyphicon glyphicon-trash"></span></a></td></tr><tr class="k-alt" data-uid="04c9cafd-bad2-402d-ac63-334b8049f79e" role="row"><td style="display:none" role="gridcell">0<input type="hidden" required="" name="CareConferenceParticipantList[1].ParticipantID" value="0"></td><td tabindex="-1" required="True" validationmessage="Enter something in this field" role="gridcell"><span> </span><input type="hidden" name="CareConferenceParticipantList[1].ParticipantName" class="participantName1" value="" maxlength="300"></td><td tabindex="-1" role="gridcell"><span> </span><input type="hidden" name="CareConferenceParticipantList[1].ParticipantRole" class="credential1" value="" maxlength="300"></td><td role="gridcell"><input type="checkbox" tabindex="-1" name="CareConferenceParticipantList[1].IsInPerson" style="margin-left:30px;" class="IsInPersonChkBx" value="false"></td><td role="gridcell"><a class="k-button k-button-icontext k-grid-deletethisrow" href="javascript:void(0)" tabindex="-1"><span class="glyphicon glyphicon-trash"></span></a></td></tr><tr data-uid="4d8828a8-6cd5-45af-8db0-3dfbdd07b121" role="row" class=""><td style="display:none" role="gridcell">0<input type="hidden" required="" name="CareConferenceParticipantList[2].ParticipantID" value="0"></td><td tabindex="-1" required="True" validationmessage="Enter something in this field" role="gridcell" class=""><span> </span><input type="hidden" name="CareConferenceParticipantList[2].ParticipantName" class="participantName2" value="" maxlength="300"></td><td tabindex="-1" role="gridcell"><span> </span><input type="hidden" name="CareConferenceParticipantList[2].ParticipantRole" class="credential2" value="" maxlength="300"></td><td role="gridcell"><input type="checkbox" tabindex="-1" name="CareConferenceParticipantList[2].IsInPerson" style="margin-left:30px;" class="IsInPersonChkBx" value="false"></td><td role="gridcell"><a class="k-button k-button-icontext k-grid-deletethisrow" href="javascript:void(0)" tabindex="-1"><span class="glyphicon glyphicon-trash"></span></a></td></tr></tbody>,并在需要时删除任何换行符。希望这会有所帮助:)

.text