我正在用一个图书发行商的例子来处理熊猫中的数据框。
仓库将生成.csv文件,这些文件将具有相同标题的书籍的签名和未签名(作者)副本视为不同的行,例如:
TITLE // STOCK
A song of ice and fire // 5
A song of ice and fire (signed) // 1
但是,我希望每个标题都为一行,但要为已签名的股票增加一列,例如:
TITLE // STOCK // SIGNED STOCK
A song of ice and fire // 5 // 1
我已经成功地将CSV读入了熊猫数据框,并添加了一个空白列SIGNED STOCK
,并用零填充。我还清理了代码,摆脱了空白和NaN
但是,我不知道如何在行中搜索带有子字符串(signed)
的标题,然后将那里的库存添加到相关标题的相关SIGNED STOCK
列中。任何帮助,不胜感激! :)
IBS_combined = pd.read_csv("IBS_21_05_19.csv",usecols=[3,12,21],encoding='latin-1')
IBS_combined.columns= ['Product', 'ISBN','Stock']
IBS_combined['Signed Stock']='0'
IBS_combined.replace(['Product'], np.nan, inplace=True)
IBS_combined.dropna(subset=['Product'], inplace=True)
答案 0 :(得分:0)
您可以执行以下操作:
signed = []
for row in IBS_combined.iterrows():
if row['TITLE'].find(your_string) != -1:
signed.append(row['TITLE'].replace(your_string,''))
然后您可以遍历已签名并添加金额
for item in signed:
IBS_combined[IBS_combined['TITLE']==item]['SIGNED'] = IBS_combined[IBS_combined['TITLE']==item]['SIGNED'] +1
答案 1 :(得分:0)
您可以将数据帧分为两个df,行分别具有带符号和无符号,然后合并结果。下面是一个示例(假定 ISBN 是识别一本书的唯一密钥,并且同一本书中有签名或未签名的股票的条目不得超过1个):
使用以下代码设置包含ISBN的示例数据:
仅1个未签名的库存条目
str="""ISBN // TITLE // STOCK
1 // A song of ice and fire // 5
1 // A song of ice and fire (signed) // 1
2 // another book // 10
2 // another book (signed) // 2
3 // 2nd book // 3
4 // 3rd book (signed) // 1"""
df = pd.read_csv(pd.io.common.StringIO(str), sep=' // ', engine='python')
根据下面的掩码m
将数据帧分为两个数据帧:
df[m]
df_unsigned:df[~m]
m = df.TITLE.str.contains('\(signed\)')
设置df_signed格式(将ISBN
设置为索引,重命名列并从TITLE列中删除子字符串'(signed)')
df_signed = df[m].set_index('ISBN')\
.rename(columns={'STOCK':'SIGNED_STOCK'}) \
.replace('\s*\(signed\)', '', regex=True)
print(df_signed)
# TITLE SIGNED_STOCK
#ISBN
#1 A song of ice and fire 1
#2 another book 2
#4 3rd book 1
设置df_unsigned并使用DataFrame.combine_first()与df_signed联接
df_new = df[~m].set_index('ISBN') \
.combine_first(df_signed) \
.fillna(0, downcast='infer') \
.reset_index()
print(df_new)
# ISBN SIGNED_STOCK STOCK TITLE
#0 1 1 5 A song of ice and fire
#1 2 2 10 another book
#2 3 0 3 2nd book
#3 4 1 0 3rd book
重新排列列的顺序:
cols = ['TITLE', 'ISBN', 'STOCK', 'SIGNED_STOCK']
df_new = df_new[cols]