从Perl中的2D阵列中删除重复项

时间:2019-01-18 05:22:58

标签: perl

我在perl中有一个2D数组,其数据以html格式的行来自数据库,如下所示:

var state = DrawItemState.Grayed;

Console.WriteLine(state == DrawItemState.Grayed);  // true

我想从数组中删除重复的行。 在上述情况下,import scrapy from scrapy import Spider from scrapy.loader import ItemLoader from booking_spider.items import BookingSpiderItem class PerhotelrevSpider(Spider): name = 'perhotelrev' allowed_domains = ['booking.com'] #start_urls = ['https://booking.com/reviews/us/hotel/maison-st-charles-quality-inn-suites.html?/'] start_urls = ['https://www.booking.com/searchresults.html?ss=New%20Orleans&'] #handle_httpstatus_list = [301, 302] def parse(self, response): all_hotels = response.xpath('.//*[@class="sr-hotel__title \n"]') for ahotel in all_hotels: hotel_name = ahotel.xpath('.//*[@class="sr-hotel__name\n"]/text()').extract_first().replace('\n','') hotel_url = ahotel.xpath('.//*[@class="hotel_name_link url"]/@href').extract_first().replace('\n','') full_hotel_url = 'https://www.booking.com'+str(hotel_url) request = scrapy.Request(full_hotel_url, callback = self.parse_hotels) request.meta['adict'] = {'HotelName':hotel_name} yield request next_page = response.xpath('.//*[@class="bui-pagination__item bui-pagination__next-arrow"]/a/@href').extract_first() if next_page is not None: next_page_url = response.urljoin(next_page) yield scrapy.Request(next_page_url, callback=self.parse) def parse_hotels(self, response): adict = response.meta['adict'] hotel_name = adict['HotelName'] #hotel_name = response.xpath('.//*[@class="hp__hotel-name"]/text()')[1].extract().replace('\n','') image_urls = response.xpath('.//*[@class="b_nha_hotel_small_images hp_thumbgallery_with_counter"]/a/@href').extract() all_facilities = response.xpath('.//*[@class="facilitiesChecklistSection"]/ul/li/span/text()').extract() all_facilities = [x.replace('\n','') for x in all_facilities] important_facility = response.xpath('.//*[@class="important_facility "]/@data-name-en').extract() #print(hotel_name) all_review_url = response.xpath('.//*[@class="show_all_reviews_btn"]/@href').extract_first() adict = { 'HotelName':hotel_name, 'ImageUrls':image_urls, 'Facilities':all_facilities, 'MainFacilities':important_facility } if all_review_url is not None: review_url = "https://booking.com"+all_review_url request = scrapy.Request(review_url, callback=self.parse_review) request.meta['adict'] = adict yield request def parse_review(self, response): allreviewsinpage = response.xpath('.//*[@itemprop="review"]') adict = response.meta['adict'] hotel_name = adict['HotelName'] image_urls = adict['ImageUrls'] all_facilities = adict['Facilities'] important_facility = adict['MainFacilities'] for eachreview in allreviewsinpage: username = eachreview.xpath('.//p[@class="reviewer_name"]/*[@itemprop="name"]/text()').extract_first() usercountry = eachreview.xpath('.//*[@itemprop="nationality"]/*[@itemprop="name"]/text()').extract_first() numreviewgiven = eachreview.xpath('.//*[@class="review_item_user_review_count"]/text()').extract_first() useragegroup = eachreview.xpath('.//*[@class="user_age_group"]/text()').extract_first() heading = eachreview.xpath('.//*[@class="review_item_header_content\n"]/*[@itemprop="name"]/text()').extract_first() neg_rev = eachreview.xpath('.//p[@class="review_neg "]/*[@itemprop="reviewBody"]/text()').extract_first() pos_rev = eachreview.xpath('.//p[@class="review_pos "]/*[@itemprop="reviewBody"]/text()').extract_first() tagging = eachreview.xpath('.//ul[@class="review_item_info_tags"]/*[@class="review_info_tag "]/text()').extract() stayedin = eachreview.xpath('.//p[@class="review_staydate "]/text()').extract_first() givenscore = eachreview.xpath('.//span[@class="review-score-badge"]/text()').extract_first() l = ItemLoader(item=BookingSpiderItem(), selector=response) l.add_value('HotelName',hotel_name) #l.add_value('ImageUrls',image_urls) l.add_value('Facilities',all_facilities) l.add_value('MainFacilities',important_facility) l.add_value('UserName',username) l.add_value('UserCountry',usercountry) l.add_value('NumReviewGiven',numreviewgiven) l.add_value('UserAgeGroup',useragegroup) l.add_value('Heading',heading) l.add_value('NegativeReview',neg_rev) l.add_value('PositiveReview',pos_rev) l.add_value('SelfTag',tagging) l.add_value('StayDate',stayedin) l.add_value('GivenScore',givenscore) yield l.load_item() next_page = response.xpath('.//*[@class="page_link review_next_page"]/a/@href').extract_first() if next_page is not None: next_page_url = response.urljoin(next_page) yield scrapy.Request(next_page_url, callback=self.parse_review) 应该被删除。 我尝试了下面的代码,但是没有成功。

<tr><td>Rafa</td><td>Nadal</td><td>Data1</td></tr>,
<tr><td>Goran</td><td>Ivan</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>

2 个答案:

答案 0 :(得分:3)

首先:您真的应该尽量不要使用过时的Perl语法和副作用。

第二个:答案取决于您从输入生成的数据结构。这是两个示例实现:

#!/usr/bin/perl
use strict;
use warnings;

# 2D Array: list of array references
my @data = (
    ['Rafa', 'Nadal', 'Data1'],
    ['Goran', 'Ivan', 'Data2'],
    ['Leander', 'Paes', 'Data2'],
    ['Leander', 'Paes', 'Data2'],
);
my %seen;

foreach my $unique (
    grep {
        not $seen{
            join('', @{ $_ })
        }++
    } @data
) {
    print join(',', @{ $unique }), "\n";
}
print "\n";

# List of "objects", keys are table column names
@data = (
    { first => 'Rafa',    last => 'Nadal', data => 'Data1' },
    { first => 'Goran',   last => 'Ivan',  data => 'Data2' },
    { first => 'Leander', last => 'Paes',  data => 'Data2' },
    { first => 'Leander', last => 'Paes',  data => 'Data2' },
);
%seen = ();

my @key_order = qw(first last data);

foreach my $unique (
    grep {
        not $seen{ 
            join('', @{ $_ }{ @key_order } )
        }++
    } @data
) {
    print join(',', @{ $unique }{ @key_order }), "\n";
}

输出:

$ perl dummy.pl
Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2

Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2

答案 1 :(得分:3)

所显示的子程序非常适合此工作,它具有一个数组,该数组的元素具有数组引用。实际上,这是组织2D数据的基本方法,其中您的行是arrayrefs。

有些模块可以用于此目的,但是这种好的旧方法也可以很好地工作

use warnings;
use strict;    
use Data::Dump qw(dd);

sub uniq_arys {
    my %seen; 
    grep { not $seen{join $;, @$_}++ } @_; 
} 

my @data = ( 
    [ qw(one two three) ],  
    [ qw(ten eleven twelve) ],  
    [ qw(10 11 12) ],  
    [ qw(ten eleven twelve) ],  
); 

my @data_uniq = uniq_arys(@data); 

dd \@data_uniq;

使用Data::Dump显示数据,按预期方式打印(最后一行不见了)。

该子项通过将每个数组连接成一个字符串来工作,然后使用哈希检查那些数组是否重复。 $;下标分隔符,而空字符串''可以代替。

这种方法会创建大量辅助数据-原则上将数据增加一倍-如果性能成为问题,最好简单地按元素进行比较(以复杂性为代价)。仅对于相当大的数据集,这可能是一个问题。


一个模块示例:使用List::UtilsBy中的uniq_by

use List::UtilsBy qw(uniq_by);

my @no_dupes = uniq_by { join '', @$_ } @data;

这与上面的子项基本相同。