我对Python和Web抓取很陌生,因此请问以下问题。
我只想获取其中包含特定内容的表。
这是HTML的外观: 它不是此脚本中的第一个表,因此我要选择
</TABLE></TD></TR>
<TR>
<TD COLSPAN=7 class='x2'>
</TD>
</TR>
<TR>
<TD style="vertical-align:bottom" class='x3'>
EingangsdatumDMYY</TD>
<TD style="vertical-align:bottom" class='x4'>
Techniker</TD>
<TD style="vertical-align:bottom" class='x5'>
Techn.</TD>
<TD style="vertical-align:bottom" class='x6'>
Kunde</TD>
<TD style="vertical-align:bottom" class='x7'>
OffAuftrag</TD>
<TD style="vertical-align:bottom" class='x8'>
Planungsdatum</TD>
<TD style="vertical-align:bottom" class='x8'>
Herstellerreferenz</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x18_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product B**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
我知道这段代码中使用的calscals很奇怪,但是它是生成的,因此不能更改。
现在我用来通过BS4获取HTML的代码:
import urllib2
from bs4 import BeautifulSoup
# specify the url
quote_page = 'Website.html'
# query the website and return the html to the variable page
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
tables = soup.findChildren('table')
my_table = tables[1]
rows = my_table.findChildren(['th', 'tr'])
print my_table
现在是问题:
我确实获得了第一行,但是我想搜索整个网站并搜索其中带有文本“ Product A”的每个表,并将父级保存在一个数组中。
例如: 完成代码后,输出将为:
<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
因此,代码必须: 1)搜索HTML并搜索文本“产品A” 2)抓住父标记并将其保存在变量中。 3)重复整个HTML。
我很高兴接受每个小费-
感谢和问候 亚尼克·L。
答案 0 :(得分:0)
您可以在Bs4
中使用正则表达式来查找包含特定文本的元素。
如果您要搜索包含特定字符串的所有td
,则需要此
import re
from bs4 import BeautifulSoup
page = '''
<TR>
<TD COLSPAN=7 class='x2'>
</TD>
</TR>
<TR>
<TD style="vertical-align:bottom" class='x3'>
EingangsdatumDMYY</TD>
<TD style="vertical-align:bottom" class='x4'>
Techniker</TD>
<TD style="vertical-align:bottom" class='x5'>
Techn.</TD>
<TD style="vertical-align:bottom" class='x6'>
Kunde</TD>
<TD style="vertical-align:bottom" class='x7'>
OffAuftrag</TD>
<TD style="vertical-align:bottom" class='x8'>
Planungsdatum</TD>
<TD style="vertical-align:bottom" class='x8'>
Herstellerreferenz</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x18_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product B**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
</TD>
<TD class='x15_0'>
</TD>
</TR>
'''
soup = BeautifulSoup(page, 'html.parser')
tables = soup.findChildren('td', text=re.compile(r'Product A'))
print(tables)
答案 1 :(得分:0)
对于bs4 4.7.1+,您可以使用:contains获取具有特定文本的表。
tables = soup.select('table:contains("Product A")')
print(tables)
带有td的表格,如果您需要更具体地显示文本的位置,但想要整个表格,则具有以下内容:
tables = soup.select('table:has(td:contains("Product A"))')
print(tables)