我正在处理一组需要清理的数据,大约 400.000行。
要执行的两个操作:
转售发票月为对象'M201705'
。我想创建一个名为'Year'
的列,其中只包含年份2017
。
一些也是对象的商业产品,其结尾为'TR'
。我想从这些产品中删除TR 。例如,对于'M23065TR'
,我想将所有产品都更改为'M23065'
,但是在该列中,还有已经很好的产品名称,例如'M340767'
或'M34TR32'
,应该保持不变。
您可以在下面找到我的尝试:
#First case
for i in range(Ndata.shape[0]):
Ndata['Year'][i] = str(Ndata['Resale Invoice Month'][i])[1:5]
#A loop takes too much time
#Tried that also :
NData['Year'] = Ndata.str['Resale Invoice Month'][1:5]
#Error : Str is not an attribute of dataframe
for i in range(Ndata.shape[0]):
if (Ndata['Commercial Product Code'][i][-2:]=='TR')==True:
Ndata.loc[i,'Commercial Product Code']=Ndata.loc[i,'Commercial Product Code'][:-2]
#same issue is a loop
#I was advice to do that :
idx = Ndata[Ndata['Commercial Product Code'].str[-2:]=='TR']
Ndata.loc[idx, 'Commercial Product Code'] = Ndata[idx]['Commercial Product Code'].str[:-2]
#It doesn't work as well
答案 0 :(得分:3)
要使用1-4个字符来表示年份,请使用Series.str[indices]
:
Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5]
要从字符串末尾删除'TR',请使用Series.str.replace
。这里$
匹配字符串的结尾:
Ndata['Commercial Product Code'] = Ndata['Commercial Product Code'].str.replace('TR$', '')
答案 1 :(得分:0)
我相信这就是您想要的:
# get the 2nd, 3rd, 4th and 5th characters of Ndata[Resale Invoice Month]
Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].astype(int)
# remove the last two characters if they are TR
Ndata.loc[Ndata['Commercial Product Code'].str[-2:] == 'TR', 'Commercial Product Code'] = Ndata['Commercial Product Code'].str[:-2]
答案 2 :(得分:0)
或者是将replace
与regex=True
一起使用的单线:
Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].replace('TR', '', regex=True)
现在:
print(df)
将符合预期。