我有一个从网络in the following format处抓取的数据集。 “技术规格”列包含以下格式的归因于该产品的所有规格:
[CONNECTIVITY] Jack: 3.5 mm jack,
[NOISE REDUCTION] Active noise-cancelling: No,
[NOISE REDUCTION] Passive noise-isolating: Yes,
[GENERAL] Weight: 13 g,
[GENERAL] Manufacturer’s guarantee: 1 year,
[GENERAL] Box contents: Bang & Olufsen H3 Headphones,
[OVERVIEW] Colour: Black,
[OVERVIEW] Type: In-ear,
----has more feature as well---
,并且我想要Reformat in this format数据,以便针对每种产品列出所有可用的技术规格。
到目前为止我尝试过
df2 = df['TECHNICAL_SPECIFICATIONS'].str.split(',', expand=True)
# Which gave me 40 new columns
df2 = df2[0].str.strip('[CONNECTIVITY]')
df3 = df2[0].str.split(':', expand = True)
# Split my 1st column in another 2 columns, (reason: wanted to make the key-value pairs and save it as a dictionary then add it back)
df3_melt = pd.melt(df3, id_vars=0)
# did this because wanted to melt it so that it can give me one single column with values
#df3["Active noise-cancelling"] = np.where(df2[2].str.contains("Active noise-cancelling: No")== True, "No" ,0)
操纵熊猫字符串都不能给我正确的输出,我需要使用正则表达式吗?
这些都没有帮助我获得想要的输出。有没有人有任何资源/博客/教程来解决这个问题或其他解决方法?