在Scrapy中,如何将正则表达式中的两个组提取到两个不同的字段中?

时间:2017-11-04 20:50:22

标签: python scrapy

我正在编写一只蜘蛛trulia来清除Trulia.com上的待售物品页面,例如https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123;当前版本可以在https://github.com/khpeek/trulia-scraper找到。

我使用Item Loaders并使用re关键字参数调用add_xpath方法来指定要提取的正则表达式。在文档的示例中,正则表达式中只有一个组,一个字段要提取到。

但是,我实际上想要定义两个组并将它们提取到两个单独的Scrapy字段。这是一个'摘录'来自parse_property_page方法:

def parse_property_page(self, response):
    l = TruliaItemLoader(item=TruliaItem(), response=response)

    details = l.nested_css('.homeDetailsHeading')
    overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
    overview.add_xpath('overview', xpath='.//li/text()')
    overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
    overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')

注意lot_size字段如何提取两个组:一个用于数字,一个用于单位,可以是'亩'或者' sqft'。如果我使用命令

运行此parse方法
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page

然后我得到以下被删除的项目:

# Scraped Items  ------------------------------------------------------------
[{'address': '1860 Lombard St',
  'area': 2524.0,
  'city_state': 'San Francisco, CA 94123',
  'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
  'description': ['Outstanding investment opportunity to own this light-fixer '
                  'mixed use Marina 2-unit property w/established income and '
                  'not on liquefaction. The first floor of this building '
                  'houses a commercial business currently leased to Jigalin '
                  'Fitness until 2018. The second floor presents a 2bed/1bath '
                  'apartment fully outfitted in a contemporary design w/full '
                  'kitchen, 10ft high ceilings & laundry area. The apartment '
                  'will be delivered vacant. The structure has undergone '
                  'renovation & features concrete perimeter foundation, '
                  'reinforced walls, ADA compliant commercial restroom, '
                  'electrical updates & rolling door. This property makes an '
                  "ideal investment with instant cash flow. Don't let this "
                  'pass you by. As-Is sale.'],
  'events': ['Sold', 'Sold', 'Sold'],
  'listing_information': ['2 Bedrooms', 'Multi-Family'],
  'listing_information_date_updated': '11/03/2017',
  'lot_size': ['1620', 'sqft'],
  'neighborhood': 'Marina',
  'overview': ['Multi-Family',
               '2 Beds',
               'Built in 1908',
               '1 days on Trulia',
               '1620 sqft lot size',
               '2,524 sqft',
               '$711/sqft'],
  'prices': ['$850,000', '$1,350,000', '$1,200,000'],
  'public_records': ['1 Bathroom',
                     'Multi-Family',
                     '1,296 Square Feet',
                     'Lot Size: 1,620 sqft'],
  'public_records_date_updated': '07/01/2017',
  'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]

其中lot_size字段是包含数字和单位的列表。但是,我最好将单位(英亩或平方英尺)提取到单独的字段lot_size_units。我可以通过首先加载项目并进行自己的处理来做到这一点,但我想知道是否有更多的Scrapy本地方式来解压缩'匹配的组分成不同的项目?

(我已经在https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py上仔细阅读了get_value方法,但这并没有向我展示方式,但如果有的话。< / p>

1 个答案:

答案 0 :(得分:1)

您可以尝试此操作(一次忽略一个组):

overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')