Question

lxml返回两个项目，而beautifulsoup只返回一个项目。那是因为<br/>不应该在那里而且beautifulsoup更能容忍坏的HTML吗？

有没有更好的方法来使用lxml提取位置？ <br/>并不总是存在。

from lxml import html
from bs4 import BeautifulSoup as bs

s = '''<td class="location">
    <p>
    TRACY,<br/>&nbsp;CA&nbsp;95304&nbsp;
    </p></td>
'''

tree = html.fromstring(s)
r = tree.xpath('//td[@class="location"]/p/text()')
print r

soup = bs(s, 'lxml')
r = soup.find_all('td', class_='location')[0].get_text()
print r

Answer 1

有没有更好的方法来使用lxml提取位置？ <br/>并非总是存在。

如果通过更好表示返回更接近其BS对应的结果，那么更好地类似于您的BS代码的XPath表达式将是：

>>> print tree.xpath('string(//td[@class="location"])')


    TRACY, CA 95304

此外，如果您希望删除多余的空格，请使用normalize-space()代替string()：

>>> print tree.xpath('normalize-space(//td[@class="location"])')
TRACY, CA 95304

Answer 2

element.get_text()加入单独的字符串运行;来自documentation：

如果您只想要文档或标记的文本部分，则可以使用get_text（）方法。它返回文档中或标记下的所有文本，作为单个Unicode字符串

强调我的。

如果您需要单个字符串，请使用Tag.strings generator：

>>> list(soup.find_all('td', class_='location')[0].strings)
[u'\n', u'\n    TRACY,', u'\xa0CA\xa095304\xa0\n    ']

如果您希望lxml加入文本，请加入文本：

r = ''.join(tree.xpath('//td[@class="location"]/p/text()'))

string() XPath function可以对<td>代码执行相同的操作：

r = tree.xpath('string(//td[@class="location"])')

演示：

>>> ''.join(tree.xpath('//td[@class="location"]/p/text()'))
u'\n    TRACY,\xa0CA\xa095304\xa0\n    '
>>> tree.xpath('string(//td[@class="location"])')
u'\n    \n    TRACY,\xa0CA\xa095304\xa0\n    '

我在任一结果上使用str.strip()：

>>> tree.xpath('string(//td[@class="location"])').strip()
u'TRACY,\xa0CA\xa095304'
>>> print tree.xpath('string(//td[@class="location"])').strip()
TRACY, CA 95304

或使用normalize-space() XPath function：

>>> tree.xpath('normalize-space(string(//td[@class="location"]))')
u'TRACY,\xa0CA\xa095304\xa0'

请注意，str.strip()会移除不间断的\xa0空格，而normalise-space()会留下空格。

lxml分隔元素，而beautifulsoup则不分隔

2 个答案: