我正在处理一个大约有20万条记录的数据框,该记录看起来像这样(信息替换为随机文本):
ID Description
1 Eg.1
2 Desc.2
3 Desc.3
80
aaa
output
500
c
d
e
f
input
100 Desc.100
200 Desc.200
我已将其设置在熊猫数据框中,并认为我可以做类似的事情:
for x in df['ID'] :
if type(df['ID'][x]) == str:
df['Description'][x-1] += ' ' + df['ID'][x].values
尝试在ID中添加错误的文本(以下是我想要获得的预期结果)
ID Description
1 Eg.1
2 Desc.2
3 Desc.3
80 aaa output
500 c d e f input
100 Desc.100
在“ ID”列中仅保留数字,并且所有描述都附加到先前的正确ID处。 (另一个问题是,在某些情况下,ID下错误文本的数量在1到10之间)
由于上面代码中的x返回了在df ['ID']部分中找到的字符串,我有点受阻,是否有任何想法可以在200k +条记录中以相对较快的方式完成?
谢谢!
答案 0 :(得分:0)
关于如何在熊猫中做到这一点的想法:
我从剪贴板中读取了您的示例
import pandas as pd
import numpy as np
df = pd.read_clipboard()
首先,我将字符串索引复制到ID为字符串的描述中。因为它应该在描述字段中。我正在使用 str(x).isnumeric()将每个单元格都视为字符串,即使不是。如果某些单元格是作为数字导入的,而某些单元格是作为字符串导入的,则 .isnumeric 部分将导致数字键入字段出现错误。
df.loc[df['ID'].apply(lambda x: not str(x).isnumeric()), 'Description'] = df['ID']
然后我只从那些条目行中清空ID
df.loc[df['ID'].apply(lambda x: not str(x).isnumeric()), 'ID'] = np.NaN
我用前一行ID填充了现在为空的ID
df['ID'] = df['ID'].fillna(method='ffill')
由于每个组的第一行仍然为空,因此我将其删除并将其余组分组
df_result = df.dropna().groupby('ID', sort=False).aggregate(lambda x: ' '.join(x))
print (df_result)
要考虑的事情:如果损坏的数据不在数据帧中,而是在文件中,我可能会编写代码,逐行通过文件并将固定行写入更正文件中。这样就不需要同时将200k行存储在内存中,这将使处理过程变得更加容易,因为您只需运行一次修订即可。
答案 1 :(得分:0)
您可以通过将非数字ID信息分配给说明来尝试仅将数字值保留在“ ID”中。向前填写ID后,应用groupby并加入说明。
df['Description'] = df.apply(lambda x : x['Description'] if x['ID'].isdigit() else x["ID"],1).fillna('')
df['ID'] = df.ID.apply(lambda x:x if x.isdigit() else np.nan).fillna(method='ffill')
df = pd.DataFrame(df.groupby(['ID'],sort=False)['Description'].apply(lambda x: ' '.join(x))).reset_index()
出局:
ID Description
0 1 Eg.1
1 2 Desc.2
2 3 Desc.3
3 80 aaa output
4 500 c d e f input
5 100 Desc.100
6 200 Desc.200
答案 2 :(得分:0)
这几乎专门使用numpy。即使代码更长,它也比pandas groupby方法要快。在ID列中重复数值是可以的(所有数值行都将返回,无论它们是否按当前代码重复)。
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ['1', '2', '3', '80', 'aaa',
'output', '500', 'c', 'd',
'e', 'f', 'input', '100', '200'],
'Description': ['Eg.1', 'Desc.2', 'Desc.3',
'', '', '', '', '', '', '',
'', '', 'Desc.100', 'Desc.200']})
IDs = df.ID.values
# numeric test function for ID column
def isnumeric(s):
try:
float(s)
return 1
except ValueError:
return 0
# find the rows which are numeric and mark with 1 (vs 0)
nums = np.frompyfunc(isnumeric, 1, 1)(IDs).astype(int)
# make another array, which marks
# str IDs with a 1 (opposite of nums)
strs = 1 - nums
# make arrays to hold shifted arrays of strs and nums
nums_copy = np.empty_like(nums)
strs_copy = np.empty_like(strs)
# make an array of nums shifted fwd 1
nums_copy[0] = 1
nums_copy[1:] = nums[:-1]
# make an array of strs shifted back 1
strs_copy[-1] = 0
strs_copy[:-1] = strs[1:]
# make arrays to detect where str and num
# ID segments begin and end
str_idx = strs + nums_copy
num_idx = nums + strs_copy
# find indexes of start and end of ID str segments
starts = np.where(str_idx == 2)[0]
ends = np.where(str_idx == 0)[0]
# make a continuous array of IDs which
# were marked as strings
txt = IDs[np.where(strs)[0]]
# split that array into string segments which will
# become a combined string row value
txt_arrs = np.split(txt, np.cumsum(ends - starts)[:-1])
# join the string segment arrays
txt_arrs = [' '.join(x) for x in txt_arrs]
# find the row indexes which will contain combined strings
combo_str_locs = np.where(num_idx == 2)[0][:len(txt_arrs)]
# put the combined strings into the Description column
# at the proper indexes
np.put(df.Description, combo_str_locs, txt_arrs)
# slice the original dataframe to retain only numeric
# ID rows
df = df.iloc[np.where(nums == 1)[0]]
# If a new index is desired >> df.reset_index(inplace=True, drop=True)
答案 3 :(得分:0)
其他方法如下所示: 输入数据:
df = pd.DataFrame({'ID': ['1', '2', '3', '80', 'aaa', 'output', '500', 'c', 'd', 'e', 'f', 'input', '100', '200'],
'Description': ['Eg.1', 'Desc.2', 'Desc.3', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Desc.100', 'Desc.200']})
处理数据框以获得所需结果的逻辑:
df['IsDigit'] = df['ID'].str.isdigit()
df['Group'] = df['IsDigit'].ne(df['IsDigit'].shift()).cumsum()
dfG = df[df['IsDigit'] == False].groupby(['Group'])['ID'].apply(lambda x: ' '.join(x))
df = df.drop(df[df['IsDigit'] == False].index)
df.loc[df['Description'].isna(), 'Description'] = df[df['Description'].isna()].apply(lambda x: dfG[x['Group'] + 1], axis=1)
df = df.drop(columns=['IsDigit', 'Group']).set_index('ID')
它会产生以下输出:
Description
ID
1 Eg.1
2 Desc.2
3 Desc.3
80 aaa output
500 c d e f input
100 Desc.100
200 Desc.200
希望对您和其他寻求类似解决方案的人有所帮助。