Question

我正在使用Scrapy。我想在以下页面上刮一下评论：https://www.thingiverse.com/thing:2/comments

我将抓取更多网站，所以我想要灵活的代码。

我不知道在不丢失有关“容器”评论所在位置和评论“深度”的信息的情况下如何刮除评论。

假设我将有3个字段。 id_container，内容和深度。这些信息足以了解注释之间的关系。如何编写代码以使每个注释都具有此信息？

这个问题很笼统，所以任何提示都是有用的

Answer 1

为了不丢失层次结构信息，您可以先获取所有深度1注释并进一步加深，例如：

from collections import OrderedDict
from pprint import pprint

def get_children_hierarchy(selector, depth=1):
    hierarchy = OrderedDict()
    children = selector.css(f'.depth-{depth}').xpath('..')
    for child in children:
        key = child.xpath('./@id').get()
        hierarchy[key] = get_children_hierarchy(child, depth+1)
    return hierarchy or None

pprint(get_children_hierarchy(response))

输出：

OrderedDict([('comment-2217537', None),
             ('comment-1518847', None),
             ('comment-1507448', None),
             ('comment-1233476', None),
             ('comment-1109024',
              OrderedDict([('comment-1554022', None),
                           ('comment-1215964', None)])),
             ('comment-874441', None),
             ('comment-712565',
              OrderedDict([('comment-731427',
                            OrderedDict([('comment-809279',
                                          OrderedDict([('comment-819752',
                                                        OrderedDict([('comment-1696778',
                                                                      None)]))]))]))])),
             ('comment-472013', None),
             ('comment-472012', OrderedDict([('comment-858213', None)])),
             ('comment-403673', None)])

然后，在注释id下，您可以获取想要的特定注释的所有信息。

在Scrapy中抓取复杂的评论

1 个答案: