我有以下格式的csv数据
ab aback abandon abate Class
ab NaN abandon NaN A
NaN aback NaN NaN A
NaN aback abandon NaN B
ab NaN NaN abate C
NaN NaN abandon abate C
我想删除NaN单元并将数据重新排列为
ab abandon A
aback A
aback abandon B
ab abate C
abandon abate C
处理后的表单中不需要标题。我尝试了许多线程,例如Remove NaN from pandas series,Missing Data In Pandas Dataframes,How can I remove Nan from list Python/NumPy等,但它们都提供了列式解决方案。
here is the sample file。 它有空单元格,当我使用数据框显示它时,所有空单元格都显示为NaN 这是代码
import pandas as pd
df = pd.read_csv('C:/Users/ABRAR/Google Drive/Tourism Project/Small_sample.csv', low_memory=False)
print(df)
答案 0 :(得分:3)
df = df.apply(lambda x: sorted(x.values.astype(str)), axis=1)\
.replace('nan','')
df = df.drop(df.index[df.eq('').all(axis=1)]) #drop all null rows
df = df.drop(df.columns[df.eq('').all()],axis=1) #drop all null columns
print(df.head())
输出:
ab aback
14 access
18 accept
23 access
24 able accept
47 accepted
答案 1 :(得分:2)
也许我误解了你的目标,但是这样的事情很容易用一些python代码完成。
#!/usr/bin/env python
new_lines = []
with open('data.csv', 'r') as csv:
# skip the first line
csv.readline()
for line in csv.readlines():
words = line.strip().split()
new_words = [w for w in words if w != 'NaN']
new_lines.append(' '.join(new_words))
for l in new_lines:
print(l)
答案 2 :(得分:0)
pandas
df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
0 1
14 access
18 accept
23 access
24 able accept
47 accepted
58 able acceptable
60 access
69 abundance
78 academy
87 access
93 accept
numpy
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
0 1
14 access
18 accept
23 access
24 able accept
47 accepted
58 able acceptable
60 access
69 abundance
78 academy
87 access
93 accept
我无法理解的头脑旋转理解
pd.DataFrame(*list(map(
list,
zip(*[(v[m], i) for v, m, i in
zip(df.values, df.notnull().values, df.index)
if m.any()])
))).fillna('')
0 1
14 access
18 accept
23 access
24 able accept
47 accepted
58 able acceptable
60 access
69 abundance
78 academy
87 access
93 accept
计时
%timeit df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
100 loops, best of 3: 7.21 ms per loop
%%timeit
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
1000 loops, best of 3: 1.29 ms per loop
%%timeit
pd.DataFrame(*list(map(
list,
zip(*[(v[m], i) for v, m, i in
zip(df.values, df.notnull().values, df.index)
if m.any()])
))).fillna('')
1000 loops, best of 3: 1.44 ms per loop
%%timeit
d1 = df.apply(lambda x: sorted(x.values.astype(str)), axis=1).replace('nan','')
d1 = d1.drop(d1.index[d1.eq('').all(axis=1)])
d1.drop(d1.columns[d1.eq('').all()],axis=1)
10 loops, best of 3: 20.1 ms per loop
答案 3 :(得分:0)
感谢@Perennial上面的建议。最后,我做了如下。
new_lines = []
with open('data.csv', 'r') as csv:
# skip the first line
csv.readline()
for line in csv.readlines():
words = line.strip().split(',')
new_words = [w for w in words if w and w.strip()]
#skip the empty lines
if len(new_words) != 0:
new_lines.append(','.join(new_words))
df = pd.DataFrame(new_lines)
df.to_csv('results.csv', sep=',')
@ Scott的解决方案很优雅,但我不知道,它总是抛出memoryError异常 还有一件事,我不想在结果文件中使用行号。如果有人帮助我。虽然,我使用Excel删除该列:)
答案 4 :(得分:0)
以下代码如果包含某些值(在这种情况下,' Amine')会删除一行:
import pandas as pd
import numpy as np
data = {'Name': ['Amine', 'Ali', 'Muhammad', 'Kareem',np.nan],
'Year': [2017, 2018,1995,2010,2018]}
df = pd.DataFrame(data)
df[df.Name != 'Amine']
具体来说:这将创建一个名为' df'的新数据框架。包括' Name'中的单元格值的所有行。列不等于“胺”
删除包含' Nan'的行。在某些专栏中,此代码将有所帮助:
df[pd.notnull(df.Name)]