我在perl中有一个2D数组,其数据以html格式的行来自数据库,如下所示:
var state = DrawItemState.Grayed;
Console.WriteLine(state == DrawItemState.Grayed); // true
我想从数组中删除重复的行。
在上述情况下,import scrapy
from scrapy import Spider
from scrapy.loader import ItemLoader
from booking_spider.items import BookingSpiderItem
class PerhotelrevSpider(Spider):
name = 'perhotelrev'
allowed_domains = ['booking.com']
#start_urls = ['https://booking.com/reviews/us/hotel/maison-st-charles-quality-inn-suites.html?/']
start_urls = ['https://www.booking.com/searchresults.html?ss=New%20Orleans&']
#handle_httpstatus_list = [301, 302]
def parse(self, response):
all_hotels = response.xpath('.//*[@class="sr-hotel__title \n"]')
for ahotel in all_hotels:
hotel_name = ahotel.xpath('.//*[@class="sr-hotel__name\n"]/text()').extract_first().replace('\n','')
hotel_url = ahotel.xpath('.//*[@class="hotel_name_link url"]/@href').extract_first().replace('\n','')
full_hotel_url = 'https://www.booking.com'+str(hotel_url)
request = scrapy.Request(full_hotel_url, callback = self.parse_hotels)
request.meta['adict'] = {'HotelName':hotel_name}
yield request
next_page = response.xpath('.//*[@class="bui-pagination__item bui-pagination__next-arrow"]/a/@href').extract_first()
if next_page is not None:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_hotels(self, response):
adict = response.meta['adict']
hotel_name = adict['HotelName']
#hotel_name = response.xpath('.//*[@class="hp__hotel-name"]/text()')[1].extract().replace('\n','')
image_urls = response.xpath('.//*[@class="b_nha_hotel_small_images hp_thumbgallery_with_counter"]/a/@href').extract()
all_facilities = response.xpath('.//*[@class="facilitiesChecklistSection"]/ul/li/span/text()').extract()
all_facilities = [x.replace('\n','') for x in all_facilities]
important_facility = response.xpath('.//*[@class="important_facility "]/@data-name-en').extract()
#print(hotel_name)
all_review_url = response.xpath('.//*[@class="show_all_reviews_btn"]/@href').extract_first()
adict = { 'HotelName':hotel_name,
'ImageUrls':image_urls,
'Facilities':all_facilities,
'MainFacilities':important_facility
}
if all_review_url is not None:
review_url = "https://booking.com"+all_review_url
request = scrapy.Request(review_url, callback=self.parse_review)
request.meta['adict'] = adict
yield request
def parse_review(self, response):
allreviewsinpage = response.xpath('.//*[@itemprop="review"]')
adict = response.meta['adict']
hotel_name = adict['HotelName']
image_urls = adict['ImageUrls']
all_facilities = adict['Facilities']
important_facility = adict['MainFacilities']
for eachreview in allreviewsinpage:
username = eachreview.xpath('.//p[@class="reviewer_name"]/*[@itemprop="name"]/text()').extract_first()
usercountry = eachreview.xpath('.//*[@itemprop="nationality"]/*[@itemprop="name"]/text()').extract_first()
numreviewgiven = eachreview.xpath('.//*[@class="review_item_user_review_count"]/text()').extract_first()
useragegroup = eachreview.xpath('.//*[@class="user_age_group"]/text()').extract_first()
heading = eachreview.xpath('.//*[@class="review_item_header_content\n"]/*[@itemprop="name"]/text()').extract_first()
neg_rev = eachreview.xpath('.//p[@class="review_neg "]/*[@itemprop="reviewBody"]/text()').extract_first()
pos_rev = eachreview.xpath('.//p[@class="review_pos "]/*[@itemprop="reviewBody"]/text()').extract_first()
tagging = eachreview.xpath('.//ul[@class="review_item_info_tags"]/*[@class="review_info_tag "]/text()').extract()
stayedin = eachreview.xpath('.//p[@class="review_staydate "]/text()').extract_first()
givenscore = eachreview.xpath('.//span[@class="review-score-badge"]/text()').extract_first()
l = ItemLoader(item=BookingSpiderItem(), selector=response)
l.add_value('HotelName',hotel_name)
#l.add_value('ImageUrls',image_urls)
l.add_value('Facilities',all_facilities)
l.add_value('MainFacilities',important_facility)
l.add_value('UserName',username)
l.add_value('UserCountry',usercountry)
l.add_value('NumReviewGiven',numreviewgiven)
l.add_value('UserAgeGroup',useragegroup)
l.add_value('Heading',heading)
l.add_value('NegativeReview',neg_rev)
l.add_value('PositiveReview',pos_rev)
l.add_value('SelfTag',tagging)
l.add_value('StayDate',stayedin)
l.add_value('GivenScore',givenscore)
yield l.load_item()
next_page = response.xpath('.//*[@class="page_link review_next_page"]/a/@href').extract_first()
if next_page is not None:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url, callback=self.parse_review)
应该被删除。
我尝试了下面的代码,但是没有成功。
<tr><td>Rafa</td><td>Nadal</td><td>Data1</td></tr>,
<tr><td>Goran</td><td>Ivan</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>
答案 0 :(得分:3)
首先:您真的应该尽量不要使用过时的Perl语法和副作用。
第二个:答案取决于您从输入生成的数据结构。这是两个示例实现:
#!/usr/bin/perl
use strict;
use warnings;
# 2D Array: list of array references
my @data = (
['Rafa', 'Nadal', 'Data1'],
['Goran', 'Ivan', 'Data2'],
['Leander', 'Paes', 'Data2'],
['Leander', 'Paes', 'Data2'],
);
my %seen;
foreach my $unique (
grep {
not $seen{
join('', @{ $_ })
}++
} @data
) {
print join(',', @{ $unique }), "\n";
}
print "\n";
# List of "objects", keys are table column names
@data = (
{ first => 'Rafa', last => 'Nadal', data => 'Data1' },
{ first => 'Goran', last => 'Ivan', data => 'Data2' },
{ first => 'Leander', last => 'Paes', data => 'Data2' },
{ first => 'Leander', last => 'Paes', data => 'Data2' },
);
%seen = ();
my @key_order = qw(first last data);
foreach my $unique (
grep {
not $seen{
join('', @{ $_ }{ @key_order } )
}++
} @data
) {
print join(',', @{ $unique }{ @key_order }), "\n";
}
输出:
$ perl dummy.pl
Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2
Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2
答案 1 :(得分:3)
所显示的子程序非常适合此工作,它具有一个数组,该数组的元素具有数组引用。实际上,这是组织2D数据的基本方法,其中您的行是arrayrefs。
有些模块可以用于此目的,但是这种好的旧方法也可以很好地工作
use warnings;
use strict;
use Data::Dump qw(dd);
sub uniq_arys {
my %seen;
grep { not $seen{join $;, @$_}++ } @_;
}
my @data = (
[ qw(one two three) ],
[ qw(ten eleven twelve) ],
[ qw(10 11 12) ],
[ qw(ten eleven twelve) ],
);
my @data_uniq = uniq_arys(@data);
dd \@data_uniq;
使用Data::Dump显示数据,按预期方式打印(最后一行不见了)。
该子项通过将每个数组连接成一个字符串来工作,然后使用哈希检查那些数组是否重复。 $;是下标分隔符,而空字符串''
可以代替。
这种方法会创建大量辅助数据-原则上将数据增加一倍-如果性能成为问题,最好简单地按元素进行比较(以复杂性为代价)。仅对于相当大的数据集,这可能是一个问题。
一个模块示例:使用List::UtilsBy中的uniq_by
use List::UtilsBy qw(uniq_by);
my @no_dupes = uniq_by { join '', @$_ } @data;
这与上面的子项基本相同。