我的问题来自网页上的表格(很遗憾,我不能提供网址,因为它是公司的内部网站。)
表格如下:
Status Class_code Major Started from
Active 4562256 Global Finance 2013
Active 4588222 Global Finance 2014
Active 4552214 Trade Management 2014
Active 8631448 Law 2012
它的代码是:
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:03">4562256</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2013>2013</DIV></TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:04">4588222</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2014></DIV>2014</TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:05">4552214</DIV></TD>
<TD class=TextColumn>International Trade</TD>
<TD class=NumColumn>
<DIV title=2014>2014</DIV></TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:06">8631448</DIV></TD>
<TD class=TextColumn>Law</TD>
<TD class=NumColumn>
<DIV title=2012>2012</DIV></TD>
...
我想要选择的BeautifulSoup是仅在2014年创建的主题,即“财务”和“国际贸易”。
我在下面使用,但它列出了所有数字。
find_number = soup.find_all('td', class_='NumColumn')
for fn in find_number :
results = fn.find_all('div')
print results
我怎样才能选择带有“2014”的那些(“类代码”总是在第2列;“从”开始总是在第4列)
感谢。
答案 0 :(得分:1)
对于更强大,更复杂的解决方案,您可以尝试使用正则表达式。
https://docs.python.org/2/howto/regex.html
基本上,它们允许您指定数据将包含在的模式中。
例如,
import re
p = re.compile(r'div*/div')//This will get you any data in a div(That is spelled lowercase)
htmldocumentasstring = "However you would do that"
print p.match(htmldocumentasstring) //This will get you only the first result
所以,它不是最漂亮或最简单的解决方案,但它是一种方式。
答案 1 :(得分:0)
您可以循环浏览各种<tr>
并依次检查每行的详细信息......
from bs4 import BeautifulSoup
html = """<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:03">4562256</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2013>2013</DIV></TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:04">4588222</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2014></DIV>2014</TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:05">International Trade</DIV></TD>
<TD class=TextColumn>4552214</TD>
<TD class=NumColumn>
<DIV title=2014>2014</DIV></TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:06">8631448</DIV></TD>
<TD class=TextColumn>Law</TD>
<TD class=NumColumn>
<DIV title=2012>2012</DIV></TD>
..."""
soup = BeautifulSoup(html)
for tr in soup.find_all('tr'):
## a list of all the divs in your tr.
divs = tr.find_all("div")
## the subject is the first...
subject = divs[0].text
## ...and the year the second "div" in divs.
year = divs[1]["title"]
if year == "2014":
print subject
针对您的新HTML进行了更新,这似乎不一致。目前的产出:
4588222
International Trade
然而,国际贸易的TR不遵循与其他TR相同的模式:代码在TextColumn中,而文本在NumberColumn中......
答案 2 :(得分:0)
另一种选择是找到想要的&#39; 2014&#39;,并使用findPrevious。 :)
soup = BeautifulSoup(html)
aaa = soup.find_all('div', attrs = {'title':'2014'})
for bbb in aaa :
ccc = bbb.findPrevious('div')
print ccc.text