我有一些数据是从另一个系统中预先填充的,其DataFrame如下所示:
id;value
101;Product_1,,,,,,,,,,,,,,,,,,,,,,,Product_2,,,,,,,,,,,,,,,,,,,,,,, Product_3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan, Product_4,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
102;,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
我正在尝试对此进行清理,以使我删除所有连续包含两个或多个逗号(,)且为空白的值。
预期输出:
id; value
101; Product_1, Product_2, Product_3, Product_4
102;
使用分号(;)标识分隔符
答案 0 :(得分:2)
首先,在将分隔符指定为分号的同时导入数据。然后,您可以运行str.replace()
来折叠逗号。您实际上要执行三种替换。
replace
。我已将其指定为空白“”,但出于许多目的,用numpy.nan
代替它会更有用。import pandas as pd
df = pd.read_csv(path, sep=';')
df['value'].str.replace(r'nan|None| ', '').str.replace(r'\,+', ', ').replace(', ', '')
df['value'].str.split(', ')