Question

我正在尝试从电子商务数据中执行属性提取任务（可能通过LSTM）。我的数据包含产品说明和关键字，例如：

description = "cat food tuna fish 200 gram"

keywords = {"type of pet": "cat", "taste" : "tuna fish", "weight" : "200 gram"}

我想将上面关键字词典中的键用作将来的学习标签。我的问题是，在提取所有关键字之后，我有许多关键字在语义上相似并且具有重复单词，这导致大约2000个标签。例如：“衣服的颜色”，“椅子的颜色”，“主要颜色”，“重量”，“净重”，“材料类型”，“木材类型”等。

我想通过对具有相同单词的键进行分组来减少标签的数量：如果字典键中包含“ color”，请将其分组在“ color”下，并将这些键的所有值分配给新/替代键，“颜色”。

此刻，我的字典如下：

{"type of material": ["wood", "metal", "rayon"....], "type of 
  fabric": ["cotton", "lycra"....]}

我希望它看起来像：

{"type": ["wood", "metal", "rayon", "cotton", "lycra"]}

什么是最好的方法，这是为LSTM标记实体的合理方法吗？

Answer 1

您可以指定要查找的关键字，然后在迭代中使用collections.defaultdict：

d = {"type of material": ["wood", "metal", "rayon"],
     "type of fabric": ["cotton", "lycra"],
     "color of chair": ["brown", "black"],
     "color of dress": ["read", "yellow"]}

from collections import defaultdict

dd = defaultdict(list)

keywords = ['type', 'color']

for k, v in d.items():
    for word in keywords:
        if word in k:
            dd[word].extend(v)
            break

请注意，找到的第一个关键字匹配项具有优先级，d中的每个值都与结果中最多一个键相关联：

defaultdict(list,
            {'color': ['brown', 'black', 'read', 'yellow'],
             'type': ['wood', 'metal', 'rayon', 'cotton', 'lycra']})

如果您需要提取所有可能的关键字（在这种情况下，将包括of，material等），则可以通过{{1 }}：

str.split

在字典键中查找重复单词并根据其对键进行分组

1 个答案: