Question

所以，我是Python和html知识有限的编程新手！我想要做的是运行一个网络爬行python程序，从一些htmls获取一些特定的名称。

假设我在某个网址中有这个html代码：

<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman"           SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000"  ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New   Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24,  2005</FONT></TD></TR>

如下所示：

   / s / ROBERT F. MANGANO
      总裁，首席执行官兼董事

（首席执行官）

     2005年3月24日

我想提取姓名和人名。所以，在python中，我写了这个：

def htmlParser(self):
    pageTree = html.fromstring(self.pageContent)
    print "page parsed!"
    tdTexts =  pageTree.xpath("//td/descendant::*/text()")
    cleanTexts = [eachText.strip() for eachText in tdTexts if eachText.strip()]
    for i in range(1,len(cleanTexts)):
        if ('/s/' in cleanTexts[i] and (i+1) < len(cleanTexts)):
            title = []
            title = [cleanTexts [i+1] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+1].lower()]
            if (title):
                print title
                self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+1]])
                print self.boards
            elif (i+2) < len(cleanTexts):
                title = [cleanTexts [i+2] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+2].lower()]
                if (title):
                    self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+2]])

我发现的唯一模式是/ s /是重复穿过表格，所以我会坚持这一点。上面的代码对我来说很完美。并告诉我这个：

; ROBERT F. MANGANO;总裁兼首席执行官

现在，我正面临着另一种形式：

</TR>
<TR VALIGN="TOP">
<TD WIDTH="40%" ALIGN="CENTER" VALIGN="CENTER"><FONT SIZE=2>/s/&nbsp;&nbsp;</FONT><FONT     SIZE=2>JONATHAN C. COON</FONT><FONT SIZE=2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT><HR NOSHADE>    <FONT SIZE=2> Jonathan C. Coon</FONT></TD>
<TD WIDTH="3%" VALIGN="CENTER"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="58%" VALIGN="CENTER"><FONT SIZE=2>Chief Executive Officer and Director (principal    executive officer)</FONT></TD>
 </TR>

看起来像：

/ s / JONATHAN C. COON Jonathan C. Coon 首席执行官兼董事（首席执行官）

它通常是相同的，但有这个“nONT;和FONT”的东西在/ s /和名称之间（在上一个表单中，/ s /后跟名称。）我不知道那么多html，所以这就是我在这两个htmls之间捕获的差异。如果有更多不同之处，请告诉我。

我认为我的代码也会在这种情况下工作相同，因为我使用“// td / descendant :: * / text（）”来消除所有的html标签和内容，只看一下这些单词。但是，当我运行后一个html的代码时，它给了我： ; ;首席执行官

正如您所看到的，在这种情况下无法捕获名称。我无法弄清楚我应该如何改变代码以涵盖这两种情况，而且由于我对html的了解不多，我无法有效地搜索解决这个问题。

任何人都可以帮助我如何修改代码以捕获这两个名称？

非常感谢。

P.S：对不起，如果我没解释的话。正如我所说，我不是专业人士！如果我的问题遗漏了一些解释，请告诉我

Answer 1

使用beautifulSoup解析html：

from bs4 import BeautifulSoup

html = """
<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman"           SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000"  ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New   Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24,  2005</FONT></TD></TR>
"""

soup = BeautifulSoup(html)

print("\n".join([x.text.strip() for x in soup.find_all("td")]))

/s/ ROBERT F. MANGANO

President, Chief Executive Officer and Director (Principal Executive Officer)

March 24,  2005

使用Python从html文件中收集信息

1 个答案: