Scrapy / XPath:替换段落中的内联标签

时间:2018-06-28 16:38:32

标签: xpath scrapy

我正在尝试使用 Scrapy p中提取并清除一些文本,这些文本包含嵌入式图标和其他标签。特别是,我想用从图像src属性中提取的文本替换图像标签:

from scrapy.selector import Selector
text = '''
<p id="1"><b><br></b>For service <i>to </i>these stations, take the <img src="images/1.png"> to 72 St or Times Sq-42 St and transfer
    <br>to an uptown <img src="images/1.png"> or <img src="images/2.png"> <i>local</i>.
    <br>
    <br>For service <i>from </i>these stations, take the <img src="images/1.png"> or <img src="images/2.png"> to 72 St or 96 St and transfer
    <br>to a South Ferry-bound <img src="images/1.png">.
    <br><b>______________________________<br></b>
</p>
'''
sel = Selector(text=text)
# do stuff

我要查找的结果是字符串:

  

要为这些站服务,请乘(1)到72 St或Times Sq-42 St,然后转移到本地的上城(1)或(2)。为了从这些车站提供服务,请乘(1)或(2)到72 St或96 St,然后转乘往南渡轮的(1)。

我可以使用以下方法从src中提取文本:

node.css('img').xpath('@src').re_first(r'images/(.+).png')

但是我仍然停留在如何遍历子节点并确定它是否是文本节点/如何过滤掉其他内联标签的问题上。这是我的位置:

description = sel.css('p#1')

def clean_html(description):
    for n in description.xpath('node()'):
        if (n.xpath('self::img')):
            yield n.xpath('@src').re_first(r'images/(.+).png')
        if (n.xpath('self::text()')):
            yield n.css('::text')

text = ''.join(clean_html(description))

2 个答案:

答案 0 :(得分:1)

在这种情况下,我认为selectors并不是特别有用。

尝试分两个阶段进行处理。

  1. 使用re.sub用您的字符串替换整个img标签 想要。
  2. 使用BeautifulSoup从结果字符串中删除剩余的HTML。

赞:

from scrapy.selector import Selector
import re
from bs4 import BeautifulSoup

# manually construct a selector for demonstration purposes
DATA = '''
<p id="1"><b><br></b>For service <i>to </i>these stations, take the <img src="images/1.png"> to 72 St or Times Sq-42 St and transfer
    <br>to an uptown <img src="images/1.png"> or <img src="images/2.png"> <i>local</i>.
    <br>
    <br>For service <i>from </i>these stations, take the <img src="images/1.png"> or <img src="images/2.png"> to 72 St or 96 St and transfer
    <br>to a South Ferry-bound <img src="images/1.png">.
    <br><b>______________________________<br></b>
</p>
'''
sel = Selector(text=DATA)

# get the raw source string to work with
text = sel.extract()

# replace image tag with text from extracted file name
image_regex = re.compile('(<img src="images/)(.+?)(.png">)', re.MULTILINE)
replaced = re.sub(image_regex, r'(\2)', text)

# remove html and return clean text
soup = BeautifulSoup(replaced, 'lxml')
print(soup.get_text())

结果:

  

要为这些站服务,请乘(1)到72 St或Times Sq-42 St   和转移       到住宅区(1)或(2)本地。

     

要从这些站点进行维修,请将(1)或(2)驶至72 St或96 St   和转移       到南渡轮(1)。       ______________________________

答案 1 :(得分:1)

这是我无需任何其他外部库的方式:

  1. 获取文本和图像路径:

    results = selector.xpath('.//text()|.//img/@src').extract()

  2. 删除多余的空格,换行和下划线:

    results = map(lambda x: x.strip('\n_ '), results)

  3. 删除空字符串:

    results = filter(None, results)

  4. 将结果加入单个段落并固定点:

    raw_paragraph = " ".join(results).replace(' .', '.')

  5. images/{Number}.png替换为({Number})

    paragraph = re.sub('images/(?P<number>\d+).png', '(\g<number>)', raw_paragraph)

结果:For service to these stations, take the (1) to 72 St or Times Sq-42 St and transfer to an uptown (1) or (2) local. For service from these stations, take the (1) or (2) to 72 St or 96 St and transfer to a South Ferry-bound (1).