我正在编写一只蜘蛛trulia
来清除Trulia.com上的待售物品页面,例如https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123;当前版本可以在https://github.com/khpeek/trulia-scraper找到。
我使用Item Loaders并使用re
关键字参数调用add_xpath方法来指定要提取的正则表达式。在文档的示例中,正则表达式中只有一个组,一个字段要提取到。
但是,我实际上想要定义两个组并将它们提取到两个单独的Scrapy字段。这是一个'摘录'来自parse_property_page
方法:
def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)
details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')
注意lot_size
字段如何提取两个组:一个用于数字,一个用于单位,可以是'亩'或者' sqft'。如果我使用命令
parse
方法
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page
然后我得到以下被删除的项目:
# Scraped Items ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]
其中lot_size
字段是包含数字和单位的列表。但是,我最好将单位(英亩或平方英尺)提取到单独的字段lot_size_units
。我可以通过首先加载项目并进行自己的处理来做到这一点,但我想知道是否有更多的Scrapy本地方式来解压缩'匹配的组分成不同的项目?
(我已经在https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py上仔细阅读了get_value
方法,但这并没有向我展示方式,但如果有的话。< / p>
答案 0 :(得分:1)
您可以尝试此操作(一次忽略一个组):
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')