我需要一些帮助。我正在尝试更改.csv文件中的一列,其中某些为空,有些具有类别列表。如下:
tdaa_matParent,tdaa_matParentQty
[],[]
[],[]
[],[]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[],[]
[Dye Penetrant Solution, BCA_Aluminum],[0.002118882, 1.3458]
但是到目前为止,我仅设法对第一列(tdaa_matParent)进行了二值化处理,但是无法将1替换为其相应的数量值,就像这样。
s = materials['tdaa_matParent']
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
BCA_Aluminum,Dye Penetrant Solution,tdaa_matParentQty
0,0,[]
0,0,[]
0,0,[]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
0,0,[]
1,1,[1.3458,0.002118882]
但是我真正想要的是每个列类别的一组新列(即BCA_Aluminum和Dye Penetrant Solution)。同样,如果填充了每一列,则将其替换为第二列的(tdaa_matParentQty)值。
例如:
BCA_Aluminum,Dye Penetrant Solution
0,0
0,0
0,0
1.3458,0
1.3458,0
1.3458,0
1.3458,0
0,0
1.3458,0.002118882
答案 0 :(得分:0)
这就是我将如何使用内置Python手段处理问题中提供的示例数据的方法:
from collections import OrderedDict
import pandas as pd
# simple case - material names are known before we process the data - allows to solve the problem with a single for loop
# OrderedDict is used to preserve the order of material names during the processing
base_result = OrderedDict([
('BCA_Aluminum', .0),
('Dye Penetrant Solution', .0)])
result = list()
with open('1.txt', mode='r', encoding='UTF-8') as file:
# skip header
file.readline()
for line in file:
# copy base_result to reuse it during the looping
base_result_copy = base_result.copy()
# modify base result only if there are values in the current line
if line != '[],[]\n':
names, values = line.strip('[]\n').split('],[')
for name, value in zip(names.split(', '), values.split(', ')):
base_result_copy[name] = float(value)
# append new line (base or modified) to the result
result.append(base_result_copy.values())
# turn list of lists into pandas dataframe
result = pd.DataFrame(result, columns=base_result.keys())
print(result)
输出:
BCA_Aluminum Dye Penetrant Solution
0 0.0000 0.000000
1 0.0000 0.000000
2 0.0000 0.000000
3 1.3458 0.000000
4 1.3458 0.000000
5 1.3458 0.000000
6 1.3458 0.000000
7 0.0000 0.000000
8 1.3458 0.002119
0.002119
而不是0.002118882
是因为默认情况下熊猫显示浮动的方式,原始精度保留在数据框中的实际数据中。
答案 1 :(得分:0)
谢谢!我建立了另一种方法也可以工作(虽然速度稍慢)。任何建议,请随时分享:)
df_matParent_with_Qty = pd.DataFrame()
# For each row in the dataframe (index and row´s column info),
for index, row in ass_materials.iterrows():
# For each row iteration save name of the element (matParent) and it´s index number:
for i, element in enumerate(row["tdaa_matParent"]):
# print(i)
# print(element)
# Fill in the empty dataframe with lists from each element
# And in each of their corresponding index (row), replace it with the value index inside the matParentqty list.
df_matParent_with_Qty.loc[index,element] = row['tdaa_matParentQty'][i]
df_matParent_with_Qty.head(10)