Question

下面是我正在处理的文档的模型：

@Mock
GeolocationParser geolocationParser;

@Mock
PlaceService placeService;

@Before
public void setUp() throws Exception {
    MockitoAnnotations.initMocks(this);
}

@Test
public void testShowPlacesByQuery() throws Exception {

    String query = "SomeQuery";
    PlaceController placeController = new PlaceController();

    Location location = mock(Location.class);
    location.setCity("someString");
    location.setLatitude("54.2323");
    location.setLongitude("18.2323");

    when(geolocationParser.getCoords(Mockito.anyString())).thenReturn(location);

    List<Place> expectedPlaces = asList(new Place(), new Place());
    when(placeService.findPublicPlaces(location, 0)).thenReturn(expectedPlaces);

    MockMvc mockMvc = MockMvcBuilders.standaloneSetup(placeController).build();
    mockMvc.perform(get("/places").param("q", query))
            .andDo(print())
            .andExpect(status().isOk())
            .andExpect(view().name("places/list"));
}

我得到的地址是这样的：

<div>
<h4>Area</h4>
  <span class="aclass"> </span>
  <span class="bclass">
        <strong>Address:</strong>
  10 Downing Street

  London

  SW1
  </span>
</div>

返回

response.xpath(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()").extract()

我正在尝试使用normalize-space来清理它。我已经尝试将它放在我能想到的每个位置，但它要么告诉我语法错误，要么返回一个空字符串。

正在更新以添加我试图在不更改选择器的情况下使其工作。例如，我有类似的案例没有[u'\r\n \t', u'\r\n 10 Downing Street\r\n\r\n London \r\n \r\n SW1\r\n ']标签。我在这里准备的示例中选择器过于复杂，但在实时版本中，我必须采用相当复杂的路径才能到达地址。

关于可能的重复根据可能重复的建议，我添加<strong>给出：

/normalize-space(.)

这会产生(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()/normalize-space(.)").extract()错误。

Answer 1

您可以找到strong元素，获取以下文字兄弟并将其标准化：

In [1]: response.xpath(u"normalize-space(.//strong[. = 'Address:']/following-sibling::text())").extract()
Out[1]: [u'10 Downing Street London SW1']

或者，您可以查看Item Loaders以及输入和输出处理器。我经常使用Join()，TakeFirst()和MapCompose(unicode.strip)来清除从额外换行符或空格中提取的数据。

Answer 2

"normalize-space(//strong[contains(text(), 'Address:')]/following-sibling::node())"

Answer 3

由于您正在使用Scrapy，因此您可以使用Python one-liner简化XPath：

" ".join(s.split()) # where `s` is your string

使用上面的内容，您可以省略XPath表达式中的normalize-space，而是使用Scrapy Input Processors创建一个可重复使用的清理函数，如下所示：

import scrapy
from scrapy.loader.processors import MapCompose
from w3lib.html import remove_tags

def normalize_space(value):
    return " ".join(value.split())

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags, normalize_space),
    )

或者您也可以在Scrapy Item Loader中使用Python表达式，如下所示：

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose

class ProductLoader(ItemLoader):
    name_in = Compose(lambda s: " ".join(s.split()))

在一个相关问题中，单行的信用额转到Tom's answer。

在Scrapy中使用normalize-space

3 个答案: