Tripadvisor上的Scrapy,Crawling评论:提取更多酒店和用户信息

时间:2015-06-30 22:36:03

标签: python scrapy scrapy-spider

需要从tripAdvisor中提取更多信息

我的代码:

 item = TripadvisorItem()

item['url'] = response.url.encode('ascii', errors='ignore') 

item['state'] =  hxs.xpath('//*[@id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
    item['state']=hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(@class,"region_title")][2]/text()').extract()

item['city'] =  hxs.select('//*[@id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
    item['city'] =hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
  item['city']=hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')

item['hotelName'] =  hxs.xpath('//*[@id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')

reviews = hxs.select('.//div[contains(@id, "review")]')

1。对于tripAdvisor中的每家酒店,酒店都有一个身份证号码。喜欢这家酒店的80075:http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS

如何从TA项中提取此ID?

  1. 我需要为每家酒店提供更多信息:shortDescription,stars,zipCode,country和coordinates(long,lat)。我可以提取这些东西吗?

  2. 我需要为每个评论提取旅行者类型。怎么样? 我的审核代码:

    for review in reviews:
    it = Review()
    
    it['state'] =  item['state']
    
    it['city'] =   item['city']
    
    it['hotelName'] = item['hotelName']
    
    it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/@title').extract()
    if(it['date']==[]):
        it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
    if(it['date']!=[]):
        it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
    
    it['userName'] = review.xpath('.//div[contains(@class,"username mo")]/span/text()').extract()
    if (it['userName']!=[]):
            it['userName']=it['userName'][0].encode('ascii', errors='ignore')
    
    it['userLocation'] = ''.join(review.xpath('.//div[contains(@class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
    
    it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(@class,"quote")]/text()').extract()
    if(it['reviewTitle']!=[]):
        it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
    else:
        it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(@class,"noQuotes")]/text()').extract()
        if(it['reviewTitle']!=[]):
            it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
    
    it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
    if(it['reviewContent']!=[]):
        it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
    
    it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/@alt').extract()
    if(it['generalRating']!=[]):
        it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
    
  3. 有一本很好的手册如何找到这些东西?我迷失了所有的跨度和div ..

    谢谢!

2 个答案:

答案 0 :(得分:2)

我会尝试在纯XPath中执行此操作。不幸的是,看起来您想要的大部分信息都包含在<script>标记中:

酒店ID - 返回“80075”

substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")

或者,酒店ID位于URL中,正如另一位回答者所提及的那样。如果您确定格式始终相同(例如在ID之前加上“d”),那么您可以使用它。

评级(顶部的那个) - 返回“3.5”

//span[contains(@class, "rating_rr")]/img/@content

此页面上有几个评分实例。顶部的主要评级是我在这里抓到的。我没有在Scrapy中测试过这个,所以它可能是由JavaScript提供的,而不是最初作为HTML的一部分加载。如果是这种情况,你需要把它拿到别的地方或使用像Selenium / PhantomJS这样的东西。

邮政编码 - 返回“10019”

(//span[@property="v:postal-code"]/text())[1]

再次,与上述相同的交易。它在HTML中,但是您应该在页面加载时检查它是否存在。

国家/地区 - 返回“”美国“”

substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")

这个附带引号。您可以随时(并且应该)使用管道来清理已删除的数据,使其看起来像您想要的那样。

坐标 - 分别返回“40.76174”和“-73.985275”

Lat:substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon:substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")

我不完全确定此页面上的简短描述存在于何处,因此我没有将其包括在内。您可能必须在其他地方导航才能获得它。我也不是100%肯定“旅行者类型”的意思,所以我会把那个留给你。

就手册而言,它实际上与练习有关。您将学习在XPath中工作的技巧和黑客,而Scrapy允许您使用一些附加功能,例如正则表达式和管道。我不建议做整个“绝对路径”XPath(即./div/div[3]/div[2]/ul/li[3]/...),因为任何与DOM内的偏差都会完全破坏你的抓取。如果您有大量数据需要抓取,并且您计划保持一段时间,那么如果任何网站移动到一个<div>,您的项目将很快变得无法管理。

我建议更多“查询”XPath,例如//div[contains(@class, "foo")]//a[contains(@href, "detailID")]。这样的路径将确保无论在你知道的元素之间放置多少元素,即使多个目标元素彼此略有不同,你也能够一致地抓住它们。

XPaths有很多试验和错误。很多。以下是一些可以帮助我的工具:

  • XPath Helper(Chrome扩展程序)
  • scrapy shell <URL>
  • scrapy view <URL>(用于在浏览器中呈现Scrapy的响应)
  • PhantomJS(如果您有兴趣获取通过JavaScript插入的数据)

希望其中一些有所帮助。

答案 1 :(得分:0)

使用正则表达式从URL获取它是否可以接受?

    package org.kp.oppr.remediation.batch.csv;

import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.remediation.batch.csv.FlatFileItemReaderNewLine;
import org.remediation.batch.model.RawItem;
import org.remediation.batch.model.RawItemLineMapper;
import org.springframework.batch.core.ExitStatus;
import org.springframework.batch.core.StepExecution;
import org.springframework.batch.core.StepExecutionListener;
import org.springframework.batch.core.annotation.BeforeStep;
import org.springframework.batch.item.file.LineCallbackHandler;
import org.springframework.batch.item.file.LineMapper;
import org.springframework.batch.item.file.mapping.DefaultLineMapper;
import org.springframework.batch.item.file.mapping.FieldSetMapper;
import org.springframework.batch.item.file.transform.DelimitedLineTokenizer;
import org.springframework.batch.item.file.transform.FieldSet;
import org.springframework.batch.item.file.transform.LineTokenizer;
import org.springframework.core.io.Resource;
import org.springframework.util.Assert;
import org.springframework.validation.BindException;

public class RawItemCsvReader extends MultiResourceItemReader<RawItem>
        implements StepExecutionListener, LineCallbackHandler,
        FieldSetMapper<RawItem> {

    static final Logger LOGGER = LogManager.getLogger(RawItemCsvReader.class);
    final private String COLUMN_NAMES_KEY = "COLUMNS_NAMES_KEY";
    private StepExecution stepExecution;
    private DefaultLineMapper<RawItem> lineMapper;
    private String[] columnNames;
    private Resource[] resources;
// = DelimitedLineTokenizer.DELIMITER_COMMA;
    private char quoteCharacter = DelimitedLineTokenizer.DEFAULT_QUOTE_CHARACTER;
    private String delimiter;


    public RawItemCsvReader() {
        setLinesToSkip(0);
        setSkippedLinesCallback(this);
    }

    @Override
    public void afterPropertiesSet() {
        // not in constructor to ensure we invoke the override
        final DefaultLineMapper<RawItem> lineMapper = new RawItemLineMapper();
        setLineMapper(lineMapper);
    }

    /**
     * Satisfies {@link LineCallbackHandler} contract and and Acts as the
     * {@code skippedLinesCallback}.
     * 
     * @param line
     */
    @Override
    public void handleLine(String line) {
        getLineMapper().setLineTokenizer(getTokenizer());
        getLineMapper().setFieldSetMapper(this);
    }

    private LineTokenizer getTokenizer() {

        // this.columnNames = line.split(delimiter);
        DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
        lineTokenizer.setQuoteCharacter(quoteCharacter);
        lineTokenizer.setDelimiter(delimiter);
        lineTokenizer.setStrict(true);
        lineTokenizer.setNames(columnNames);
        addColumnNames();
        return lineTokenizer;
    }

    private void addColumnNames() {
        stepExecution.getExecutionContext().put(COLUMN_NAMES_KEY, columnNames);
    }


    @Override
    public void setResources(Resource[] resources) {

        this.resources = resources;
        super.setResources(resources);

    }



    /**
     * Provides acces to an otherwise hidden field in parent class. We need this
     * because we have to reconfigure the {@link LineMapper} based on file
     * contents.
     * 
     * @param lineMapper
     */
    @Override
    public void setLineMapper(LineMapper<RawItem> lineMapper) {
        if (!(lineMapper instanceof DefaultLineMapper)) {
            throw new IllegalArgumentException(
                    "Must specify a DefaultLineMapper");
        }
        this.lineMapper = (DefaultLineMapper) lineMapper;

        super.setLineMapper(lineMapper);
    }

    private DefaultLineMapper getLineMapper() {
        return this.lineMapper;
    }

    /**
     * Satisfies {@link FieldSetMapper} contract.
     * 
     * @param fs
     * @return
     * @throws BindException
     */
    @Override
    public RawItem mapFieldSet(FieldSet fs) throws BindException {
        if (fs == null) {
            return null;
        }
        Map<String, String> record = new LinkedHashMap<String, String>();
        for (String columnName : this.columnNames) {
            record.put(columnName,
                    StringUtils.trimToNull(fs.readString(columnName)));
        }
        RawItem item = new RawItem();
        item.setResource(resources);
        item.setRecord(record);
        return item;
    }

    @BeforeStep
    public void saveStepExecution(StepExecution stepExecution) {
        this.stepExecution = stepExecution;
    }

    @Override
    public void beforeStep(StepExecution stepExecution) {
        //LOGGER.info("Start Raw Read Step for " + itemResource.getFilename());

    }

    @Override
    public ExitStatus afterStep(StepExecution stepExecution) {
        LOGGER.info("End Raw Read Step for lines read: " + stepExecution.getReadCount()
                + " lines skipped: " + stepExecution.getReadSkipCount());

        /*
        LOGGER.info("End Raw Read Step for " + itemResource.getFilename()
                + " lines read: " + stepExecution.getReadCount()
                + " lines skipped: " + stepExecution.getReadSkipCount());
                */
        return ExitStatus.COMPLETED;
    }

    public void setDelimiter(String delimiter) {
        this.delimiter = delimiter;
    }

    public void setQuoteCharacter(char quoteCharacter) {
        this.quoteCharacter = quoteCharacter;
    }

    public String[] getColumnNames() {
        return columnNames;
    }

    public void setColumnNames(String[] columnNames) {
        this.columnNames = columnNames;
    }

    public String getDelimiter() {
        return delimiter;
    }

}