Question

我有一个从网络in the following format处抓取的数据集。 “技术规格”列包含以下格式的归因于该产品的所有规格：

方括号中的规格类别
规格
规格值

[CONNECTIVITY] Jack: 3.5 mm jack, 
[NOISE REDUCTION] Active noise-cancelling: No, 
[NOISE REDUCTION] Passive noise-isolating: Yes, 
[GENERAL] Weight: 13 g, 
[GENERAL] Manufacturerâ€™s guarantee: 1 year, 
[GENERAL] Box contents: Bang &amp; Olufsen H3 Headphones, 
[OVERVIEW] Colour: Black, 
[OVERVIEW] Type: In-ear, 
----has more feature as well---

，并且我想要Reformat in this format数据，以便针对每种产品列出所有可用的技术规格。

到目前为止我尝试过

df2 = df['TECHNICAL_SPECIFICATIONS'].str.split(',', expand=True)
# Which gave me 40 new  columns 
df2 = df2[0].str.strip('[CONNECTIVITY]') 
df3 = df2[0].str.split(':', expand = True)
# Split my 1st column in another 2 columns, (reason: wanted to make the key-value pairs and save it as a dictionary then add it back)
df3_melt = pd.melt(df3, id_vars=0)
# did this because wanted to melt it so that it can give me one single column with values 

#df3["Active noise-cancelling"] = np.where(df2[2].str.contains("Active noise-cancelling: No")== True, "No" ,0)

操纵熊猫字符串都不能给我正确的输出，我需要使用正则表达式吗？

这些都没有帮助我获得想要的输出。有没有人有任何资源/博客/教程来解决这个问题或其他解决方法？

从一个列中分离或提取文本，然后根据值将其添加到其他列中

0 个答案: