HTML解析 - 在所有标记之间获取文本

时间:2018-05-14 05:25:12

标签: python html beautifulsoup

我想在特定tr中的所有标签之间获取文本。我看过类似的问题,但它们特定于标签类型。

如果我这样做:

for strong_tag in soup.find_all('strong'):
    print strong_tag.text

这是针对特定标签,但如何为完整的tr。?

<tr>
   <td style="border:0px solid black;padding: 0px 5.4pt;border-color: currentColor windowtext windowtext;border-style: none solid solid;border-width: medium 0pt 0pt;background: white;" width="39">
      <p align="center" style="min-height: 8pt; padding: 0px; text-align: center;"> </p>
   </td>
   <td colspan="7" style="border:0px solid black;vertical-align: top;text-align: left;padding: 0px 5.4pt;border-color: currentColor windowtext windowtext currentColor;border-style: none solid solid none;border-width: medium 0pt 0pt medium;background: white;" width="683">
      <ol style="list-style-type: decimal;">
         <li>Process the return per standard procedures. Refer to the <a class="jive-link-wiki-small" data-containerid="2456" data-containertype="14" data-objectid="12425" data-objecttype="102" href="https://iconnect.sprint.com/docs/DOC-12425">Sprint Satisfaction Guarantee Procedure</a> for steps.</li>
         <li>RMS will reset the eligibility when doing a <strong>Sprint Monthly Installments Return</strong>. If the original transaction was performed in RMS, the system will display a message and advise that a history transaction can be performed or you can proceed with a No History Return</li>
         <li>
            To reset Monthly Installments upgrade eligibility and process the return:
            <ol>
               <li>Return the device.</li>
               <li>Re-access the account to see if the line is still <strong>upgrade-eligible for Monthly Installments</strong>.</li>
            </ol>
            <ul>
               <ul>
                  <li><strong>If so,</strong> proceed with the sale as normal.</li>
                  <li>
                     If the customer's line is showing as <strong>not upgrade-eligible</strong> for Monthly Installments:
                     <ol>
                        <li>Add a note to the customer's account stating the return transaction number and the need for eligibility reset.</li>
                        <li>Reset the customer's eligibility by using the MSA tablet or through iCare <em><strong>or</strong></em></li>
                        <li>Contact <strong>NSS</strong> to request an eligibility reset <strong>only</strong> if the reset was <strong>not successful</strong>.<strong> </strong></li>
                     </ol>
                  </li>
               </ul>
               <ul>
                  <li><span style="font-family: Arial;">Once eligibility is reset, pull up the customer's account again in RMS and process the sale.</span></li>
               </ul>
            </ul>
         </li>
      </ol>
   </td>
</tr>

预期的输出是: 所有标签之间的文字

1 个答案:

答案 0 :(得分:1)

get_text()获取所有子字符串并使用给定的分隔符

返回连接

textget_text方法的属性 - 未记录

print(soup.select('tr')[0].text)

使用对齐

import bs4
soup=bs4.BeautifulSoup(open('h.html'),'lxml')
def get_text(i):
   r=[]
   for t in i.contents:
      if type(t)==bs4.element.NavigableString:r.append(t.strip())
      elif t.name in ['strong','span'] :r.append(t.text.strip())
   return ' '.join(r)


s=soup.select('li',)
for i in s:
   level=(len(i.find_parents('ol')+i.find_parents('ul')))-1
   print(' '*level*5,get_text(i))
   print('-'*50)