Question

这是我的代码段。我正在尝试使用Scrapy抓取一个网站，然后将数据存储在Elasticsearch中以进行索引。

def parse(self, response):
    for news in response.xpath('head'):
        yield {
            'pagetype': news.xpath('//meta[@name="pagetype"]/@content').extract(),
            'description': news.xpath('//div[@class="module__content"]/*/node()/text()').extract(),
              }

现在我的问题是保存在'description'字段中的值。

    [u'\n              \n              ', u'"For\n              many of us what we eat on Christmas day isn\'t what we would usually consume and\n              that\u2019s perfectly ok," Dr said.', u'"However\n              it is not uncommon for festive season celebrations to begin in November and\n              continue well in to the New Year.', u'"So\n              if health is on the agenda, being mindful about what we put into our bodies\n              with a balanced approach, throughout the whole festive season, is important."', u"Dr\n              , a lecturer at School\n              Sciences, said balancing fresh, healthy food with being physically active was a\n              good start.", u'"Whatever\n              the celebration, try to limit processed foods, often high in fat, sugar and\n              salt," she said.', u'"Taking\n              time during holidays to prepare food and make the most of fresh ingredients is\n              often a much healthier option than relying on convenience foods and take away.', u'"Being\n              mindful about going back for seconds is important too.\xa0 We don\u2019t need to eat until we feel\n              uncomfortable and eating the foods we enjoy doesn\'t necessarily mean we need to\n              eat copious amounts."', u"Dr\n             own healthy tips and substitutes for the Christmas season\n              include:", u'But\n              just because Dr  is a dietitian, doesn\u2019t mean she doesn\u2019t enjoy a\n              Christmas treat or two.', u'"I\n              would have to say my sister in law\'s homemade rocky road is my favourite\n              festive treat. She makes it every Christmas day and it gets better each year," she\n              said.', u'"I\n              also enjoy a summer cocktail every so often during the festive season and a\n              mojito would be one of my favourites on Christmas day. We make it with extra\n              mint from the garden which is a nice, fresh addition.', u'"Rather\n              than focusing on food avoidance, moderation is the best approach.', u'"There\n              are definitely some more healthy choices and some less healthy options when it\n              comes to the typical Christmas day menu, but it\'s more important to be mindful\n              of a healthy, balanced diet throughout the festive period, rather than avoiding\n              specific foods on one day of the year."', u'\n                ', u'\n              \n                ', u'\n                ', u'\n              \n                ', u'\n              ', u'\n                ', u'\n                        ', u'\n                        ', u'\n                        ', u'\n                    ', u'\n            ', u'Related News', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'Search for related news']

有很多空格，换行符和'u'字母......

如何进一步处理此代码以包含普通文本，没有额外的空格，换行符（\ n）和'u'字母？

我读到 BeautifulSoup 与Scrapy配合得很好，但我找不到任何关于如何将Scrapy与BeautifulSoup集成的示例。我也愿意使用任何其他方法。非常感谢任何帮助。

由于

Answer 1

您可以使用例如in this answer显示的方法从列表中的字符串中删除空格和换行符：

[' '.join(item.split()) for item in list_of_strings]

其中list_of_strings是您给出的字符串列表。

关于“你的来信”，你不应该真的担心它们。它们只是意味着字符串采用unicode编码。参见例如关于此事this question。

Scrapy：如何清理响应？

1 个答案: