使用beatifulsoup 4 find()无法找到具有摘要属性的表,该表具有带有新行和空白的摘要属性

时间:2019-03-26 10:10:17

标签: python beautifulsoup

我试图解析AWR报告以获取长期运行的SQL信息,该报告具有40多个表,其中所有表具有相同的类但具有不同的摘要。为了进行解析,Python上的BS4能够找到多个表,但是一个包含所有SQL信息的表的摘要带有换行符和空格,如下所示:

AWR文件中的HTML标记:

<table border="0" class="tdiff" summary="This table displays the text of the SQL statements which have been
      referred to in the report">
<tbody><tr><th class="awrbg" scope="col">SQL Id

我尝试使用BS4 find()定位此表,但是每次都失败。任何帮助将不胜感激。

from bs4 import BeautifulSoup as BS4    
awrFile='/XXXXXXXXXXXXXXXXXXX/test/XXXXXXXXXXDB69-1.html'
f_awr = open(awrFile, 'r')
soup  = BS4(f_awr, 'html.parser')
sqlTextInfoTable = soup.find('table', {'summary':'This table displays the text of the SQL statements which have been referred to in the report'})

print(sqlTextInfoTable)打印None

4 个答案:

答案 0 :(得分:0)

您能只使用熊猫和.read_html()吗,因为它带有<table>标签?

html = '''<table border="0" class="tdiff" summary="This table displays the text of the SQL statements which have been
      referred to in the report">
<tbody><tr><th class="awrbg" scope="col">SQL Id'''


import pandas as pd

table = pd.read_html(html)
sqlTextInfoTable = table[0]

那就这样做吧

import pandas as pd

awrFile='/XXXXXXXXXXXXXXXXXXX/test/XXXXXXXXXXDB69-1.html'
f_awr = open(awrFile, 'r')
table = pd.read_html(f_awr)
sqlTextInfoTable = table[0]

输出:

print (sqlTextInfoTable)
        0
0  SQL Id

答案 1 :(得分:0)

您可以find_all()表并像这样遍历表...

import pandas as pd

awrFile='/XXXXXXXXXXXXXXXXXXX/test/XXXXXXXXXXDB69-1.html'
f_awr = open(awrFile, 'r')
soup  = BS4(f_awr, 'html.parser')

for table in soup.find_all('table'):
    df = pd.read_html(str(table))
    print(df) 

答案 2 :(得分:0)

您也许可以使用CSS属性=值选择器组合来匹配子字符串。在这里,我使用^(以运算符开头)。您还可以使用*(包含)运算符。

matches = soup.select("table[summary^='this table displays the text of the SQL statements which have been']")

答案 3 :(得分:0)

使用re搜索summary属性的特定文本。

from bs4 import BeautifulSoup
import re

data='''<table border="0" class="tdiff" summary="This table displays the text of the SQL statements which have been
      referred to in the report">
<tbody><tr><th class="awrbg" scope="col">SQL Id'''
soup=BeautifulSoup(data,'html.parser')
sqlTextInfoTable =soup.find('table', summary=re.compile('This table displays the text of the SQL statements'))
print(sqlTextInfoTable)

OR

from bs4 import BeautifulSoup
import re
data='''<table border="0" class="tdiff" summary="This table displays the text of the SQL statements which have been
      referred to in the report">
<tbody><tr><th class="awrbg" scope="col">SQL Id'''
soup=BeautifulSoup(data,'html.parser')
sqlTextInfoTable =soup.find('table', summary=re.compile('referred to in the report'))
print(sqlTextInfoTable)

输出:

<table border="0" class="tdiff" summary="This table displays the text of the SQL statements which have been
      referred to in the report">
<tbody><tr><th class="awrbg" scope="col">SQL Id</th></tr></tbody></table>