Question

我使用Python和正则表达式来查找HTML文档并且与大多数人说的不同，即使事情可能出错，它也能正常运行。无论如何，我认为Beautiful Soup会更快更容易，但我真的不知道如何让它做我用正则表达式做的事情，这很容易，但很麻烦。

我正在使用此页面的HTML：

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

编辑：

以下是主要地方的HTML：

<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>

第一个类似地方的例子：

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">

当我的程序获得它并且我使用Beautiful Soup使其更具可读性时，HTML与Firefox的“查看源”略有不同......我不知道为什么。

这些是我的正则表达式：

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main)

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main)

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main)

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

前两个是主要地点和地址。其余的是其他地方的信息。在我做完这些之后，我决定我只想要cNames，cAddresses和cURLs的前5个结果，因为我不需要91或者它是什么。

我不知道如何用BS找到这种信息。我可以用BS做的就是找到特定的标签并用它们做事。这个HTML有点复杂，因为所有的信息。我想要的是桌子，桌面标签也是一团糟......

如何获取该信息，并将其仅限于前5个结果？

感谢。

Answer 1

人们说你出于某种原因无法使用正则表达式解析HTML，但这里有一个适用于你的正则表达式的简单原因：你的regexp中有\n和 以及那些可以和将随意更改您要解析的页面。当发生这种情况时，你的正则表达式将不匹配，你的代码将停止工作。

然而，您要做的任务非常简单

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('this-stackoverflow-page.html'))

for anchor in soup('a'):
    print anchor.contents, anchor.get('href')

生成所有Anchor标记，无论它们出现在此页面的深层嵌套结构中的哪个位置。以下是我从该三行脚本的输出中摘录的行：

[u'Stack Exchange'] http://stackexchange.com
[u'msw'] /users/282912/msw
[u'faq'] /faq
[u'Stack Overflow'] /
[u'Questions'] /questions
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
[u'python'] /questions/tagged/python
[u'beautifulsoup'] /questions/tagged/beautifulsoup
[u'Marcus Johnson'] /users/1587751/marcus-johnson

很难想象能够为你做更多工作的代码更少。

如何使用Beautiful Soup从HTML文档中获取纯文本和URL？

1 个答案: