Question

我目前正在用沙皮测试我的第一只蜘蛛，除了尝试提取数据时，其他一切似乎都工作正常。

我建立了一个管道，以便将项目保存并重定向到可以使用sqlite3读取的数据库，每个元素有多个标签，但是在导出数据时仅保留第一个。

我有3列数据：

标题（每行1个数据）
作者（每行一个数据）
标签（每行多个数据）。

问题在于列标签仅显示捕获的第一个标签

class QuotetutorialPipeline(object):

    def __init__(self):
        self.create_connection()
        self.create_table()

    def create_connection(self):
        self.conn = _sqlite3.connect("myquotes.db")
        self.curr = self.conn.cursor()

    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS quotes_tb""")
        self.curr.execute("""create table quotes_tb(
                        title text,
                        author text,
                        tag text
                        )""")

    def process_item(self, item, spider):
        self.store_db(item)
        print("Pipeline :" + item['title'][0])
        return item

    def store_db(self,item):
        self.curr.execute(""" insert into quotes_tb values (?,?,? )""", (
            item['title'][0],
            item['author'][0],
            item['tag']
        ))

我希望item[tag]有多个元素，但是只保存了第一个。

Answer 1

我想item ['tag']是一个列表。您必须决定要如何存储它。

# option 1: as json
import json
tags = json.dumps(item['tag'])  #  '["tag1", "tag2", ..]'

# option 2: as joined string
tags = '|'.join(item['tag']) # 'tag1|tag2'

# option 3: one row for each tag
for tag in item['tag']:
    self.curr.execute(""" insert into quotes_tb values (?,?,? )""", (
            item['title'][0],
            item['author'][0],
            tag
        ))

提取数据时，每列如何存储1个以上的信息（标签）？

1 个答案: