从(网页)表中选择性地选择文本

时间:2014-08-08 05:13:40

标签: python beautifulsoup

我的问题来自网页上的表格(很遗憾,我不能提供网址,因为它是公司的内部网站。)

表格如下:

Status  Class_code  Major           Started from
Active  4562256     Global Finance      2013
Active  4588222     Global Finance      2014
Active  4552214     Trade Management    2014
Active  8631448     Law                 2012

它的代码是:

<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:03">4562256</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2013>2013</DIV></TD>
...
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:04">4588222</DIV></TD>
<TD class=TextColumn>Global Finance</TD>
<TD class=NumColumn>
<DIV title=2014></DIV>2014</TD>
...    
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:05">4552214</DIV></TD>
<TD class=TextColumn>International Trade</TD>
<TD class=NumColumn>
<DIV title=2014>2014</DIV></TD>
...    
<TR class=Data align=left>
<TD class=TextColumn>Active</TD>
<TD class=NumColumn>
<DIV title="No:06">8631448</DIV></TD>
<TD class=TextColumn>Law</TD>
<TD class=NumColumn>
<DIV title=2012>2012</DIV></TD>
...

我想要选择的BeautifulSoup是仅在2014年创建的主题,即“财务”和“国际贸易”。

我在下面使用,但它列出了所有数字。

find_number = soup.find_all('td', class_='NumColumn')

for fn in find_number :
    results = fn.find_all('div')
    print results        

我怎样才能选择带有“2014”的那些(“类代码”总是在第2列;“从”开始总是在第4列)

感谢。

3 个答案:

答案 0 :(得分:1)

对于更强大,更复杂的解决方案,您可以尝试使用正则表达式。

https://docs.python.org/2/howto/regex.html

基本上,它们允许您指定数据将包含在的模式中。

例如,

import re
p = re.compile(r'div*/div')//This will get you any data in a div(That is spelled lowercase)
htmldocumentasstring = "However you would do that"
print p.match(htmldocumentasstring) //This will get you only the first result

所以,它不是最漂亮或最简单的解决方案,但它是一种方式。

答案 1 :(得分:0)

您可以循环浏览各种<tr>并依次检查每行的详细信息......

from bs4 import BeautifulSoup

html =  """<TR class=Data align=left>
            <TD class=TextColumn>Active</TD>
            <TD class=NumColumn>
            <DIV title="No:03">4562256</DIV></TD>
            <TD class=TextColumn>Global Finance</TD>
            <TD class=NumColumn>
            <DIV title=2013>2013</DIV></TD>
            ...
            <TR class=Data align=left>
            <TD class=TextColumn>Active</TD>
            <TD class=NumColumn>
            <DIV title="No:04">4588222</DIV></TD>
            <TD class=TextColumn>Global Finance</TD>
            <TD class=NumColumn>
            <DIV title=2014></DIV>2014</TD>
            ...    
            <TR class=Data align=left>
            <TD class=TextColumn>Active</TD>
            <TD class=NumColumn>
            <DIV title="No:05">International Trade</DIV></TD>
            <TD class=TextColumn>4552214</TD>
            <TD class=NumColumn>
            <DIV title=2014>2014</DIV></TD>
            ...    
            <TR class=Data align=left>
            <TD class=TextColumn>Active</TD>
            <TD class=NumColumn>
            <DIV title="No:06">8631448</DIV></TD>
            <TD class=TextColumn>Law</TD>
            <TD class=NumColumn>
            <DIV title=2012>2012</DIV></TD>
            ..."""

soup = BeautifulSoup(html)

for tr in soup.find_all('tr'):

    ## a list of all the divs in your tr.
    divs = tr.find_all("div")

    ## the subject is the first...
    subject = divs[0].text

    ## ...and the year the second "div" in divs.
    year = divs[1]["title"]

    if year == "2014":
        print subject

针对您的新HTML进行了更新,这似乎不一致。目前的产出:

4588222
International Trade

然而,国际贸易的TR不遵循与其他TR相同的模式:代码在TextColumn中,而文本在NumberColumn中......

答案 2 :(得分:0)

另一种选择是找到想要的&#39; 2014&#39;,并使用findPrevious。 :)

soup = BeautifulSoup(html)

aaa = soup.find_all('div', attrs = {'title':'2014'})

for bbb in aaa :
    ccc = bbb.findPrevious('div')
    print ccc.text