我对Python很陌生,我正在尝试清理一些数据。我已将链接附加到数据文件(两个选项卡:原始数据和所需结果)。请帮忙!
我正在尝试做的事情:
链接到原始数据(第一标签)和所需结果(第二标签): https://www.dropbox.com/s/kjgtwoelq21eetw/Example2.xlsx?dl=0
我目前所拥有的:
import numpy as np
data_xls=pd.read_excel("Example2.xlsx", index_col=None).fillna('')
data_xls = data_xls.iloc[22:]
data_xls.rename(columns=data_xls.iloc[0]).drop(data_xls.index[0])
data_xls['Internal Link Tracking (non-promotions) - ENT (c20)'].str.split('-', expand=True)
writer = pd.ExcelWriter('Output2.xlsx')
data_xls.to_excel(writer, 'O1', index=False)
writer.save()
非常感谢您的帮助! 泰
答案 0 :(得分:0)
使用:
# Read the excel file with sheet_name='Raw data' and skiprows=23 which are not necessary
data_xls = pd.read_excel("Example2.xlsx", sheet_name='Raw data', skiprows=23)
# Create the dummy columns names which are similar to desired output column
dummy_col_names = ['Internal Link Tracking (non','Campaign Name','Creative','Action','Action 2']
# Use str.split with expand=True to create a dataframe
dummy_df = data_xls['Internal Link Tracking (non-promotions) - ENT (c20)'].str.split('-',expand = True)
# Rename columns as per dummy column list
dummy_df.columns = dummy_col_names
# Drop the column which is not necessary
data_xls.drop('Internal Link Tracking (non-promotions) - ENT (c20)', axis=1, inplace=True)
# Use pd.concat along axis=1 to concat both data_xls and dummy_df along columns
data_xls = pd.concat((data_xls,dummy_df),sort=False,axis=1)
# To preserve oreder similar to desired output column use the following code
col_names = data_xls.columns.tolist()
data_xls = data_xls[col_names[:1]+dummy_col_names+col_names[1:-5]]
答案 1 :(得分:0)
使用熊猫将一列分为2列
d = pd.read_csv('file.csv
')
col_1
"val1-val2"
"valA-valB"
df = pd.DataFrame(d.col_1.str.split("-",1).tolist(),columns = ['A','B'])
A B
0 val1 val2
1 valA valB
答案 2 :(得分:0)
尝试一下:
1。)删除第1-23行
df = pd.read_excel('/home/mayankp/Downloads/Example2.xlsx', sheet_name=0, index_col=None, header=None, skiprows=23)
2。)使用'-'作为分隔符将B列分成多个列和 3。)将列名称分配给新列
这两个步骤都可以一次性完成:
sub_df = df[1].str.split('-', expand=True).rename(columns = lambda x: "string"+str(x+1))
In [179]: sub_df
Out[179]:
string1 string2 string3 string4 string5
1 us campaign article1 scrolldown findoutnow
2 us campaign article1 scrollright None
3 us campaign article1 findoutnow None
4 us campaign payablesmanagement findoutnow None
以上是样本在-
上分割后的样子。
现在从df
中删除实际列,并在其中插入以下新列:
df = df.drop(1, axis=1)
df = pd.concat([df,sub_df], axis=1)
4。)保留数字列
剩余的列已经完整。无需更改。
让我知道这是否有帮助。