Question

我有一个问题令我发疯。我有一个包含数百万个条目的列表，我需要从中提取产品类别。每个条目如下所示："[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]" 类型检查确实确实给了我字符串：print(type(item)) <class 'str'> 现在，我在网上搜索了可能的（最好是快速的-因为有上百万个条目）正则表达式解决方案，以提取所有类别。

我在Match single quotes from python re处发现了几个问题：我尝试了re.findall(r"'(\w+)'", item)，但只得到了[]括号。然后，我继续寻找诸如此类的替代方法：Python Regex to find a string in double quotes within a string有人尝试了以下matches=re.findall(r'\"(.+?)\"',item) print(matches)，但是在我的情况下也失败了……

此后，我尝试了一些愚蠢的方法来获得至少一种解决方法，并在以后解决此问题：list_cat_split = item.split(',')这给了我

e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]

然后我尝试使用字符串方法摆脱这些东西，然后应用正则表达式：

list_categories = []
for item in list_cat_split:
    item.strip('\"')
    item.strip(']')
    item.strip('[')
    item.strip()
    category = re.findall(r"'(\w+)'", item)
    if category not in list_categories:
        list_categories.append(category)

但是，即使这种方法失败了：[['Electronics'], []] 我进行了进一步搜索，但未找到合适的解决方案。抱歉，如果这个问题完全愚蠢，我是regex的新手，也许这对常规regex用户来说是不费吹灰之力？

更新：

以某种方式我无法回答自己的问题，因此在此进行了更新：感谢您提供答案-很抱歉，您提供的信息不完整，我很少在这里问问题，通常会尝试自行寻找解决方案。.我不想使用数据库，因为这只是我对ML-的预处理工作的一小部分完全用Python编写的应用程序。这也是我的MSc项目，因此没有生产环境。因此，我可以一劳永逸地使用较慢但可解决的解决方案。但是据我所知，@ FailSafe的解决方案对我有用：screenshot of my jupyter notebook here the result with list

但是，是的，我完全同意@ WiktorStribiżew：在生产设置中，我肯定会设置一个数据库并让它运行一整夜。.无论如何，感谢您的所有帮助，很棒的人在这里：-）

Answer 1

这可能不是您的最终答案，但会创建类别列表。

x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"

y=x[2:-2]
z=y.split(',')

for item in z:
    print(item)

REGEX查找给定字符串内的所有匹配项

1 个答案: