我是Python的新手。我正在使用Pandas编辑csv文件,我找到了一个完成这项工作的功能。我想知道是否有人可以告诉我如何修改函数,以便不更新电子表格中的最后2列,这些列名为(ty Daniel Himmelstein)'Start_X'和'Start_Y'。我需要它留空格,以后会用新数据填写。谢谢,
启动电子表格的示例:
AK MINE VET X Y
1016649 0 90;59,180;26,270;39,0;9,270;20,0;17, 482547 1710874
需要如何格式化的示例:
AK MINE VET VET_2 X Y
1016649 0 90 59 482547 1710874
1016649 0 180 26
1016649 0 270 39
1016649 0 0 9
1016649 0 270 20
1016649 0 0 17
以下是代码:
def tidy_split(df, column, sep='|', keep=False):
"""
Split the values of a column and expand so the new DataFrame has one split
value per row. Filters rows where the column is missing.
Params
------
df : pandas.DataFrame
dataframe with the column to split and expand
column : str
the column to split and expand
sep : str
the string used to split the column's values
keep : bool
whether to retain the presplit value as it's own row
Returns
-------
pandas.DataFrame
Returns a dataframe with the same columns as `df`.
"""
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
答案 0 :(得分:0)
使用csv模块可以更好地处理这类事情。 Pandas非常适合分析和操作数据,但在这种情况下,我会在加载到DataFrame之前正确格式化文件。
你可以,
import csv
# Where the new data will be stored
data = []
# Open up the csv file
with open('file.csv', 'r') as f:
# Go through each row
for i, row in enumerate(csv.reader(f)):
if i == 0:
continue
# Break up the row based on the columns
ak, mine, *vet, x, y = row
# Get VET and VET_2
v12 = [v.split(';') for v in vet]
# Create new rows with split values of `vet`
for j, (v1, v2) in enumerate(v12):
if j == 0:
new = [ak, mine, v1, v2, x, y]
else:
new = [ak, mine, v1, v2, None, None] # Leave blank spaces after first value
data.append(new)
# Write out to a new csv file
with open('new_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
# Write header
writer.writerow(['AK', 'MINE', 'VET', 'VET_2', 'X', 'Y'])
# Write data
writer.writerows(data)
使用此输入file.csv
:
AK,MINE,VET,,,,,,X,Y
1016649,0,90;59,180;26,270;39,0;9,270;20,0;17,482547,1710874
我得到以下输出new_file.csv
:
AK,MINE,VET,VET_2,X,Y
1016649,0,90,59,482547,1710874
1016649,0,180,26,,
1016649,0,270,39,,
1016649,0,0,9,,
1016649,0,270,20,,
1016649,0,0,17,,
将csv文件格式化为“正确”后,加载到pandas会更容易。
答案 1 :(得分:0)
您正在寻找的方法是DataFrame.stack(),它会将DataFrame的形状从一行中的所有“vets”更改为单独一行中的每个“vet”。
一旦形状正确,您可以继续进一步分割数据。这应该让你开始:
s = df.VET.str.split(",").apply(pd.Series).stack()
s.index = s.index.droplevel(-1)
s = s.apply(lambda x: pd.Series(x.split(";"))).dropna()
result = df.drop("VET", axis=1).join(s)