Question

我正在处理一组需要清理的数据，大约 400.000行。

要执行的两个操作：

转售发票月为对象'M201705'。我想创建一个名为'Year'的列，其中只包含年份2017。
一些也是对象的商业产品，其结尾为'TR'。我想从这些产品中删除TR 。例如，对于'M23065TR'，我想将所有产品都更改为'M23065'，但是在该列中，还有已经很好的产品名称，例如'M340767'或'M34TR32'，应该保持不变。

您可以在下面找到我的尝试：

#First case
for i in range(Ndata.shape[0]):    
    Ndata['Year'][i] = str(Ndata['Resale Invoice Month'][i])[1:5]
#A loop takes too much time
#Tried that also : 
NData['Year'] = Ndata.str['Resale Invoice Month'][1:5]
#Error : Str is not an attribute of dataframe

for i in range(Ndata.shape[0]):
    if (Ndata['Commercial Product Code'][i][-2:]=='TR')==True:
        Ndata.loc[i,'Commercial Product Code']=Ndata.loc[i,'Commercial Product Code'][:-2]
#same issue is a loop

#I was advice to do that : 
idx = Ndata[Ndata['Commercial Product Code'].str[-2:]=='TR']
Ndata.loc[idx, 'Commercial Product Code'] = Ndata[idx]['Commercial Product Code'].str[:-2]
#It doesn't work as well

Answer 1

要使用1-4个字符来表示年份，请使用Series.str[indices]：

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5]

要从字符串末尾删除'TR'，请使用Series.str.replace。这里$匹配字符串的结尾：

Ndata['Commercial Product Code'] = Ndata['Commercial Product Code'].str.replace('TR$', '')

Answer 2

我相信这就是您想要的：

# get the 2nd, 3rd, 4th and 5th characters of Ndata[Resale Invoice Month]

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].astype(int)

# remove the last two characters if they are TR

Ndata.loc[Ndata['Commercial Product Code'].str[-2:] == 'TR', 'Commercial Product Code'] = Ndata['Commercial Product Code'].str[:-2]

Answer 3

或者是将replace与regex=True一起使用的单线：

Ndata['Year'] = Ndata['Resale Invoice Month'].str[1:5].replace('TR', '', regex=True)

现在：

print(df)

将符合预期。

熊猫：优化，删除循环

3 个答案: