Question

我有这段代码：

class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://www.theguardian.com/media/social-media']
def parse(self, response):
    items = []
    #Define keywords present in metadata to scrap the webpage
    keywords = ['social media','social business','social networking','social marketing','online marketing','social selling',
        'social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem',
        'digital marketing','social media manager','community manager']
    for link in response.xpath("//a"):
        item = SocialMediaItem()
        #Extract webpage keywords 
        metakeywords = link.xpath('//meta[@name="keywords"]').extract()

        #Compare keywords and extract if one of the defined keyboards is present in the metadata
        for metaKW in metakeywords:
            if metaKW in keywords:
                item['SourceTitle'] = link.xpath('/html/head/title').extract()
                item['TargetTitle'] = link.xpath('text()').extract()
                item['link'] = link.xpath('@href').extract()
                outbound = str(link.xpath('@href').extract())
                if 'http' in outbound:
                    items.append(item)
    return items

它的目的是比较变量＆＃39;关键字＆＃39; （列表）包含变量＆＃39;元关键字＆＃39;，它们是使用link.xpath('//meta[@name="keywords"]').extract()提取的网页关键字。比较它时，如果找到一个匹配项，它应该提取项目并将其附加到最后一个if语句中。但是，它没有任何结果。我知道它应该扔东西，因为我检查了网页网址（http://www.socialmediaexaminer.com/）。陈有人帮忙吗？干杯！

达尼

Answer 1

在代码行之间查看我的评论。

   def parse(self, response):
            items = []
            #Define keywords present in metadata to scrap the webpage
            keywords = ['social media','social business','social networking','social marketing','online marketing','social selling','social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem','digital marketing','social media manager','community manager']
        for link in response.xpath("//a"):
            item = SocialMediaItem()
            #Extract webpage keywords 
            metakeywords = link.xpath('//meta[@name="keywords"]').extract()

.extract（）返回什么样的数据？你能举个例子吗？我不熟悉你正在使用的图书馆。

            #compare keywords and extract if one of the defined keyboards is present in the metadata
            for keywords in metakeywords:

这是第一个主要问题。在for循环中“in”之前定义的变量应该是与您已定义的任何变量共享名称。你应该使用一个新的名字，比如 “metaKW。”当检查“metakeywords”中的每个项时，动态设置此变量的值你的循环。

                if keywords in metakeywords:

因为“关键字”逐个接受元关键字中每个项目的值，这个声明必然总是评估为真，所以它是微不足道的/不必要的。但是，假设您实际上引用了您在代码中更高级别定义的关键字列表... 在这种情况下，当您遍历元关键字列表时（我假设它是一个列表或其他类型的迭代），“关键字”和“元关键字”都不会改变它的价值。所以你会问同样的问题一遍又一遍，不改变问题的条款，并得到相同的结果。清除其中一些问题，如果您仍未达到预期效果，请告知我们。

                    item['SourceTitle'] = link.xpath('/html/head/title').extract()
                    item['TargetTitle'] = link.xpath('text()').extract()
                    item['link'] = link.xpath('@href').extract()
                    outbound = str(link.xpath('@href').extract())
                    if 'http' in outbound:
                        items.append(item)
                        return items

通过更具建设性的方式进行编辑：

你想要使用的循环是这样的......

for metaKW in metakeywords:
    if metaKW in keywords:
        # the rest of your code.

如何循环列表直到找到匹配项？

1 个答案: