Question

我想收集Google搜索的日语文章。我尝试提取日语句子，然后运行以下代码以获取包含最多日语单词的标签。

texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
    text += s

但是当我运行这段代码时，提取的句子的头部带有空格。例如，如果html如下并且路径为'// p'，

<p class dir='sample'>
    <span>
        <a role='button' tabindex='0' style='white-space: normal;'>A
        B</a>
        <span> </span>
    </span>
</p>

我的句子如下。

A
B

我试图通过'text.strip（）'方法消除这些空间，但是这些空间仍然存在。

如何从此html中获取“ AB”？或者如何消除空格？如果有人告诉我如何获得“ AB”，我将不胜感激。

Answer 1

这可以用正则表达式完成：

>>> import re
>>> re.sub(r'\n\s+', '', s)
'AB'

如何消除空格？

1 个答案: