删除空行并使用python清空[]

时间:2017-04-10 12:59:14

标签: python csv pandas dataframe

我的csv文件中有10000行。我想删除空括号[]和空行[[]]的行,如下图所示:

enter image description here

例如第一列中的第一个单元格:

[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

需要转变为:

[['1', 2364, 2382, 1552, 1585],['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

和只有空括号的行:

[[]]    [[]]

需要从文件中删除。结果我们得到:

enter image description here

我尝试过:

df1 = df.Column_1.str.strip([]).str.split(',', expand=True)

我的数据来自字符串类

print(type(df.loc[0,'Column_1']))
<class 'str'>

print(type(df.loc[0,'Column_2']))
<class 'str'>

EDIT1 执行以下代码后:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

它解决了这个问题。但是我遇到了comma的问题(作为字符而不是分隔符)','

从结果行

。我想创建一个新的csv文件,如下所示:

columns =['char', 'left', 'right', 'top', 'down']

,例如对应于:

'1' 2364 2382 1552 1585

获取csv文件如下:

   char  left  top  right  bottom
0   'm'    38  104   2456    2492
1   'i'    40  102   2442     222
2   '.'   203  213    191     198
3   '3'   235  262    131    3333
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

所以得到这个的整个代码是:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

cols = ['char','left','right','top','bottom']

df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)

但是这样做我在我的文件中找不到任何','然后它在新的csv文件中变得无序而不是:

',' 1491    1494    172 181 

我没有逗号','。这种疾病在以下两行中解释:

 '    '     1491    1494    172
181  'r'    1508    1517    159

它应该是:

','1491 1494 172 181
 'r'1508 1517 159 ......等等

EDIT2

我正在尝试添加另外两个名为line_numberall_chars_in_same_row

的列

1)line_number对应于例如

的行
'm' 38 104 2456 2492 
line 2

提取

2)all_chars_in_same_row对应于同一行中的所有(间隔)字符。例如

character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

l得到'1''8''4''1''7'等等。

更正式:all_chars_in_same_row表示:在line_number列中写入给定行的所有字符

char  left  top  right  bottom     line_number  all_chars_in_same_row
0   'm'    38  104   2456    2492   from line 2  'm' '2' '5' 'g'
1   'i'    40  102   2442     222   from line 4
2   '.'   203  213    191     198   from line 6
3   '3'   235  262    131    3333  
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

与此相关的代码是: 将pandas导入为pd

df_data=pd.read_csv('see2.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.positionlrtb.str.strip('[]').str.split(', ', expand=True)

x=len(df_data.columns) #get total number of columns 
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0) 
 # now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)

df_data.columns = [df_data.columns % 5, df_data.columns // 5]

df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]

和输出

     char  left   top right bottom  from line all_chars_in_same_row
0     '.'   203   213   191    198          0  ['.', '3', 'C']
1     '3'  1758  1775   370    391          0  ['.', '3', 'C']
2     'C'   296   305  1492   1516          0  ['.', '3', 'C']
3     'A'   275   347   147    239          1  ['A', 'M', 'D']
4     'M'  2166  2184   370    391          1  ['A', 'M', 'D']
5     'D'   339   362  1815   1840          1  ['A', 'M', 'D']
6     'A'    73    91   373    394          2  ['A', 'D', 'A']
7     'D'  1395  1415   427    454          2  ['A', 'D', 'A']
8     'A'  1440  1455  2047   2073          2  ['A', 'D', 'A']
9     'D'   454   473   663    685          3  ['D', 'O', '0']
10    'O'  1533  1545   487    541          3  ['D', 'O', '0']
11    '0'   339   360  2137   2163          3  ['D', 'O', '0']
12    'A'   108   129   727    751          4  ['A', 'V', 'I']
13    'V'  1659  1677   490    514          4  ['A', 'V', 'I']
14    'I'   339   360  1860   1885          4  ['A', 'V', 'I']
15    'N'    34    51   949    970          5  ['N', '/', '2']
16    '/'  1890  1904   486    505          5  ['N', '/', '2']
17    '2'  1266  1283  1951   1977          5  ['N', '/', '2']
18    'S'  1368  1401    43     85          6  ['S', 'A', '8']
19    'A'  1344  1361   583    607          6  ['S', 'A', '8']
20    '8'  2207  2217  1492   1515          6  ['S', 'A', '8']
21    'S'  1437  1457   112    138          7  ['S', 'o', 'O']
22    'o'  1548  1580   979   1015          7  ['S', 'o', 'O']
23    'O'  1331  1349   370    391          7  ['S', 'o', 'O']
24    'h'  1686  1703   315    339          8  ['h', 't', 't']
25    't'   169   190  1291   1312          8  ['h', 't', 't']
26    't'   169   190  1291   1312          8  ['h', 't', 't']
27    'N'  1331  1349   370    391          9  ['N', 'C', 'C']
28    'C'   296   305  1492   1516          9  ['N', 'C', 'C']
29    'C'   296   305  1492   1516          9  ['N', 'C', 'C']

然而,我得到了一个奇怪的结果(字母,数字,列,标题的顺序......)。我不能分享它们文件太长。我试着分享它。但是超过了最大字符数。 这行代码

df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

返回None Value

  0      1      2      3      4     5      6      7      8      9     ...   \
0  'm'     38    104   2456   2492   'i'     40    102   2442   2448  ...    
1  '.'    203    213    191    198   '3'    235    262    131    198  ...    
2  'A'    275    347    147    239   'M'    363    465    145    239  ...    
3  'A'     73     91    373    394   'D'     93    112    373    396  ...    
4  'D'    454    473    663    685   'O'    474    495    664    687  ...    
5  'A'    108    129    727    751   'V'    129    150    727    753  ...    
6  'N'     34     51    949    970   '/'     52     61    948    970  ...    
7  'S'   1368   1401     43     85   'A'   1406   1446     43     85  ...    
8  'S'   1437   1457    112    138   'o'   1458   1476    118    138  ...    
9  'h'   1686   1703    315    339   't'   1706   1715    316    339  ...    
   1821  1822  1823  1824  1825  1826  1827  1828  1829  1830  
0  None  None  None  None  None  None  None  None  None  None  
1  None  None  None  None  None  None  None  None  None  None  
2  None  None  None  None  None  None  None  None  None  None  
3  None  None  None  None  None  None  None  None  None  None  
4  None  None  None  None  None  None  None  None  None  None  
5  None  None  None  None  None  None  None  None  None  None  
6  None  None  None  None  None  None  None  None  None  None  

EDIT3 但是,当我添加page_numbercharacter_position

df1 = pd.DataFrame({
        "from_line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position)),
        "page_number" : np.repeat(df.index.values,df['page_number'])
})

我收到了以下错误:

 File "/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

4 个答案:

答案 0 :(得分:1)

你可以使用列表理解:

arr = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

new_arr = [x for x in arr if x]

或许您更喜欢list + filter

new_arr = list(filter(lambda x: x, arr))

lambda x: x在这种情况下工作的原因是因为特定的lambda正在测试x中的给定arr是否是&#34;真实的。&#34;更具体地说,lambda将过滤掉arr中&#34; falsey,&#34;就像一个空列表[]。它几乎就像是在说,&#34;让arr中的所有内容保持在&#39;存在&#39;,&#34;可以这么说。

答案 1 :(得分:1)

对于列表,您可以先使用applymap list comprehension删除[],然后删除boolean indexing的所有行,其中掩码检查是否行中的all值为0 - 空列表。

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

如果any值为[[]],则需要删除行:

df1 = df1[~(df1.applymap(len).eq(0)).any(1)]

如果值为字符串

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

然后dropna

df1 = df1.dropna(how='all')

或者:

df1 = df1.dropna()

EDIT1:

df = pd.read_csv('see2.csv', index_col=0)

df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)

df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
      page_number                                       positionlrtb  \
0  1841729699_001  [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...   
1  1841729699_001   [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]   
2  1841729699_001  [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...   
3  1841729699_001  [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...   
4  1841729699_001  [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...   

                    LineIndex  
0      [[mi, il, mu, il, il]]  
1                      [[.3]]  
2                   [[amsun]]  
3  [[adresse, de, livraison]]  
4                [[document]]

cols = ['char','left','top','right','bottom']

df1 = pd.DataFrame({
        "a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
        "b": list(chain.from_iterable(df.positionlrtb))})

df1 = pd.DataFrame(df1.b.values.tolist())    
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)   

print (df1)
     char  left   top  right  bottom
0       m    38   104   2456    2492
1       i    40   102   2442    2448
2       i    40   100   2402    2410
3       l    40   102   2372    2382
4       m    40   102   2312    2358
5       u    40   102   2292    2310
6       i    40   104   2210    2260
7       l    40   104   2180    2208
8       i    40   104   2140    2166

EDIT2:

#skip first row
df = pd.read_csv('see2.csv', usecols=[2], names=['character_position'], skiprows=1)
print (df.head())
                                  character_position
0  [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1  [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2  [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3  [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4  [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
#convert to list, remove empty lists
df.character_position = df.character_position.apply(ast.literal_eval)
df.character_position = df.character_position.apply(lambda x: [y for y in x if len(y) > 0])

#new df - http://stackoverflow.com/a/42788093/2901002
df1 = pd.DataFrame({
        "from line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position))})

#filter by list comprehension string only, convert to tuple, because need create index 
df1['all_chars_in_same_row'] = 
df1['b'].apply(lambda x: tuple([y for y in x if isinstance(y, str)]))
df1 = df1.set_index(['from line','all_chars_in_same_row'])
#new df from column b
df1 = pd.DataFrame(df1.b.values.tolist(), index=df1.index)   
#Multiindex in columns
df1.columns = [df1.columns % 5, df1.columns // 5]
#reshape
df1 = df1.stack().reset_index(level=2, drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
#convert last columns to int
df1[cols[1:]] = df1[cols[1:]].astype(int)
df1 = df1.reset_index()
#convert tuples to list
df1['all_chars_in_same_row'] = df1['all_chars_in_same_row'].apply(list)
print (df1.head(15))
    from line           all_chars_in_same_row char  left  top  right  bottom
0           0  [m, i, i, l, m, u, i, l, i, l]    m    38  104   2456    2492
1           0  [m, i, i, l, m, u, i, l, i, l]    i    40  102   2442    2448
2           0  [m, i, i, l, m, u, i, l, i, l]    i    40  100   2402    2410
3           0  [m, i, i, l, m, u, i, l, i, l]    l    40  102   2372    2382
4           0  [m, i, i, l, m, u, i, l, i, l]    m    40  102   2312    2358
5           0  [m, i, i, l, m, u, i, l, i, l]    u    40  102   2292    2310
6           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2210    2260
7           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2180    2208
8           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2140    2166
9           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2124    2134
10          1                          [., 3]    .   203  213    191     198
11          1                          [., 3]    3   235  262    131     198
12          2                 [A, M, S, U, N]    A   275  347    147     239
13          2                 [A, M, S, U, N]    M   363  465    145     239
14          2                 [A, M, S, U, N]    S   485  549    145     243

答案 2 :(得分:0)

new_list = []
for x in old_list:
    if len(x) > 0:
        new_list.append(x)

答案 3 :(得分:0)

你可以这样做:

lst = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_lst = [i for i in lst if len(i) > 0]