Get the row index of each extracted character from csv file

时间:2017-04-06 16:47:43

标签: python csv pandas dataframe

I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position

Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position

character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.

My file whole csv file is as follow :

df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])

From this file l created a new csv file with five columns :

column 1:  character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
   char  left  top  right  bottom
0   'm'    38  104   2456    2492
1   'i'    40  102   2442     222
2   '.'   203  213    191     198
3   '3'   235  262    131    3333
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

l want to add 2 other column called line_number and all_chars_in_same_row 1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2 2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance

character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

l get '1' '8' '4' '1' '7' and so on.

more formally : all_chars_in_same_row means: write all the character of the given row in line_number column

char  left  top  right  bottom     line_number  all_chars_in_same_row
0   'm'    38  104   2456    2492   from line 2  'm' '2' '5' 'g'
1   'i'    40  102   2442     222   from line 4
2   '.'   203  213    191     198   from line 6
3   '3'   235  262    131    3333  
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

EDIT1:

import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')

df_data.shape

(50, 3)

df_data.icol(1)   
0     [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1     [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2     [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3     [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4     [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5     [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6     [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7     [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8     [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9     [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10    [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11    [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12    [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13    [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14    [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15    [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16    [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17    [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18    [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19    [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20    [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21    [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22    [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23    [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24    [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25    [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26    [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27    [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28                      [['S', 2369, 2382, 1833, 1866]]
29    [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30    [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31    [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32    [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33    [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34    [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35    [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36                         [['c', 88, 118, 2872, 2902]]
37    [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38    [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39    [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40                                                 [[]]
41    [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42    [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43                                                 [[]]
44    [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45    [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46    [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47    [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48    [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49    [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object

Then in my char.csv l do the following

    df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
    df = df.replace(['\[','\]'], ['',''], regex=True)




cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')

However l don't see in your response how you added the columns from_line and all_char_in_same_rows.

when l execute your line of code :

df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)

l get the following :

df_data[0:10]
  0      1      2      3      4     5      6      7      8      9     ...   \
0  'm'     38    104   2456   2492   'i'     40    102   2442   2448  ...    
1  '.'    203    213    191    198   '3'    235    262    131    198  ...    
2  'A'    275    347    147    239   'M'    363    465    145    239  ...    
3  'A'     73     91    373    394   'D'     93    112    373    396  ...    
4  'D'    454    473    663    685   'O'    474    495    664    687  ...    
5  'A'    108    129    727    751   'V'    129    150    727    753  ...    
6  'N'     34     51    949    970   '/'     52     61    948    970  ...    
7  'S'   1368   1401     43     85   'A'   1406   1446     43     85  ...    
8  'S'   1437   1457    112    138   'o'   1458   1476    118    138  ...    
9  'h'   1686   1703    315    339   't'   1706   1715    316    339  ...    
   1821  1822  1823  1824  1825  1826  1827  1828  1829  1830  
0  None  None  None  None  None  None  None  None  None  None  
1  None  None  None  None  None  None  None  None  None  None  
2  None  None  None  None  None  None  None  None  None  None  
3  None  None  None  None  None  None  None  None  None  None  
4  None  None  None  None  None  None  None  None  None  None  
5  None  None  None  None  None  None  None  None  None  None  
6  None  None  None  None  None  None  None  None  None  None  

Here are the 10 first lines of my csv file :

    character_position
0   [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1   [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2   [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3   [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4   [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5   [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6   [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7   [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8   [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9   [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10  [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]

here is the second csv file:

    char    left    right   top bottom
0   'm' 38  104 2456    2492
1   'i' 40  102 2442    2448
2   'i' 40  100 2402    2410
3   'l' 40  102 2372    2382
4   'm' 40  102 2312    2358
5   'u' 40  102 2292    2310
6   'i' 40  104 2210    2260
7   'l' 40  104 2180    2208
8   'i' 40  104 2140    2166

EDIT1

HERE IS MY output for solution 2 (`input character_position described` )

    1831    1830    level_2 char    left    top right   bottom  FromLine    all_chars_in_same_row
0   0   character_position  0   character_position                  0   character_position
1   1   'm','i','i','l','m','u','i','l','i','l' 0   'm' 38  104 2456    2492    1   'm','i','i','l','m','u','i','l','i','l'
2   1   'm','i','i','l','m','u','i','l','i','l' 1   'i' 40  102 2442    2448    1   'm','i','i','l','m','u','i','l','i','l'
3   1   'm','i','i','l','m','u','i','l','i','l' 2   'i' 40  100 2402    2410    1   'm','i','i','l','m','u','i','l','i','l'

l think the probelm comes from the fact that l have in my data : [[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]] so :

empty `[ ]`  causes problem for the order. l noticed that when l tried to omit  all [] which are empty beacause l find my csv as follow :

in char : ['a' rather than 'a' for values 8794] rather than 8794 or [5345 rather than 5345 so processed the csv as follow

    df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
    df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')

enter image description here

Then l noticed the following

look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ? l have also empty line

3831    '6' 296 314 3204    3231
3832                    
3833    '1' 480 492 3202    3229

Line 3832 should be removed.

in order to get something like thisenter image description here

**EDIT2:**

In order to solve the problem of empty rows and [] in list_characters.csv

[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]] and [[]] [[]]

l did the following :

    df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

    df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

    df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

    df1 = df1.dropna()
then

df = pd.read_csv('character_position.csv', index_col=0)

df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)

df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
      page_number                                       positionlrtb  \
0  1841729699_001  [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...   
1  1841729699_001   [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]   
2  1841729699_001  [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...   
3  1841729699_001  [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...   
4  1841729699_001  [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...   

                    LineIndex  
0      [[mi, il, mu, il, il]]  
1                      [[.3]]  
2                   [[amsun]]  
3  [[adresse, de, livraison]]  
4                [[document]]

cols = ['char','left','top','right','bottom']

df1 = pd.DataFrame({
        "a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
        "b": list(chain.from_iterable(df.positionlrtb))})

df1 = pd.DataFrame(df1.b.values.tolist())    
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)   
print (df1)
     char  left   top  right  bottom
0       m    38   104   2456    2492
1       i    40   102   2442    2448
2       i    40   100   2402    2410
3       l    40   102   2372    2382
4       m    40   102   2312    2358
5       u    40   102   2292    2310
6       i    40   104   2210    2260
7       l    40   104   2180    2208
8       i    40   104   2140    2166

However :

df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

returns None values

1 个答案:

答案 0 :(得分:1)

创建所需的数据框后,在堆叠后,不要删除索引,它会保留您的行号。由于这是一个多级索引,因此请获取第一个索引 - 您的行号。

df_data['LineIndex'] = df_data.index.get_level_values(0)

然后,您可以按LineIndex列进行分组,并获取公共LineIndex的所有字符。这是作为字典创建的。将此字典转换为数据框,然后最终将其合并到实际数据

  

解决方案1 ​​

import pandas as pd

df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]

df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print  df_data

df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column

cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names

#create a new dictionary
#it contains the line number as key and all the characters from that line as value

DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}

#convert dictionary to a dataframe 
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']

#   Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
  

解决方案2

我个人比第一个更喜欢这个解决方案。有关更多详细信息,请参阅内联注释

import pandas as pd

df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

x=len(df_data.columns) #get total number of columns 
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0) 
 # now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)

df_data.columns = [df_data.columns % 5, df_data.columns // 5]

df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]

输出

      char  left   top right bottom  from line all_chars_in_same_row
0     '.'   203   213   191    198          0  ['.', '3', 'C']
1     '3'  1758  1775   370    391          0  ['.', '3', 'C']
2     'C'   296   305  1492   1516          0  ['.', '3', 'C']
3     'A'   275   347   147    239          1  ['A', 'M', 'D']
4     'M'  2166  2184   370    391          1  ['A', 'M', 'D']
5     'D'   339   362  1815   1840          1  ['A', 'M', 'D']
6     'A'    73    91   373    394          2  ['A', 'D', 'A']
7     'D'  1395  1415   427    454          2  ['A', 'D', 'A']
8     'A'  1440  1455  2047   2073          2  ['A', 'D', 'A']
9     'D'   454   473   663    685          3  ['D', 'O', '0']
10    'O'  1533  1545   487    541          3  ['D', 'O', '0']
11    '0'   339   360  2137   2163          3  ['D', 'O', '0']
12    'A'   108   129   727    751          4  ['A', 'V', 'I']
13    'V'  1659  1677   490    514          4  ['A', 'V', 'I']
14    'I'   339   360  1860   1885          4  ['A', 'V', 'I']
15    'N'    34    51   949    970          5  ['N', '/', '2']
16    '/'  1890  1904   486    505          5  ['N', '/', '2']
17    '2'  1266  1283  1951   1977          5  ['N', '/', '2']
18    'S'  1368  1401    43     85          6  ['S', 'A', '8']
19    'A'  1344  1361   583    607          6  ['S', 'A', '8']
20    '8'  2207  2217  1492   1515          6  ['S', 'A', '8']
21    'S'  1437  1457   112    138          7  ['S', 'o', 'O']
22    'o'  1548  1580   979   1015          7  ['S', 'o', 'O']
23    'O'  1331  1349   370    391          7  ['S', 'o', 'O']
24    'h'  1686  1703   315    339          8  ['h', 't', 't']
25    't'   169   190  1291   1312          8  ['h', 't', 't']
26    't'   169   190  1291   1312          8  ['h', 't', 't']
27    'N'  1331  1349   370    391          9  ['N', 'C', 'C']
28    'C'   296   305  1492   1516          9  ['N', 'C', 'C']
29    'C'   296   305  1492   1516          9  ['N', 'C', 'C']