我的csv文件中有一个列(第二列名为second_column),表示字符列表及其位置如下:
名为character_position
此列的每一行都包含一个character_position列表。总体而言,此列中有300行,每行包含字符位置列表
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
每个角色都有值:左,上,右,下。例如
字符'1'
已离开= 1890,顶部= 1904,右= 486,底部= 505。
我想创建另一个包含五列的csv文件:
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
以下是我的所作所为:
import pandas as pd :
character= []
left_position = []
right_position = []
top_position=[]
bottom_position []
temp_df = pd.DataFrame({'character': position_char, 'left': left_position, 'top': top_position, 'right' : right_position,'bottom': bottom_position})
temp_df.to_csv('character_position.csv')
我的文件包含一列中的characters_position列,可以按如下方式访问:
import csv
with open('list_characters.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
content = list(row[i] for i in second_column)
如何将文件second_column
的{{1}}的内容提取并拆分为具有五列的新csv文件list_characters.csv
?
我的第一行character_position如下:
character_position.csv
我的第二行character_position如下:
[['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
以下是l访问字符位置列的方法:
[['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
当我尝试:
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
df
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
.. ...
70 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
71 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
72 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
73 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
74 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
75 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
76 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
77 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
78 [['S', 2369, 2382, 1833, 1866]]
79 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
80 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
81 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
82 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
83 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
84 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
85 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
86 [['c', 88, 118, 2872, 2902]]
87 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
88 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
89 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
90 [[]]
91 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
92 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
93 [[]]
94 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
95 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
96 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
97 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
98 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
99 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
然后如你所说的那样申请:
df['character_position'][1]
"[['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]"
df['character_position'][2]
"[['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]"
我得到这个错误:
df=pd.DataFrame(np.array(list(chain.from_iterable( df['character_position'][2]))).reshape(-1,5),
columns=cols)
为了检查我做了这个:
Traceback (most recent call last):
File "/usr/lib/python3.5/code.py", line 91, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
ValueError: cannot reshape array of size 127 into shape (5)
然后:
Y=df['character_position'][2]
"[['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]"
但我们应该得到
Y[0] gives '['
因为它是一个嵌套列表。
我认为我们需要一个迭代该列整行的循环:
['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]
执行以下代码后:
df.ix[:len(df)]
df=pd.DataFrame(np.array(list(chain.from_iterable( df['character_position'].ix[:len(df)]))).reshape(-1,5),
columns=cols)
我遇到了这个错误:
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
代码一直运行到df1.columns = cols
Traceback (most recent call last):
File "/usr/lib/python3.5/code.py", line 91, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3054, in astype
raise_on_error=raise_on_error, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3189, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3056, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 461, in astype
values=values, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 504, in _astype
values = _astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/local/lib/python3.5/dist-packages/pandas/types/cast.py", line 534, in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/lib.pyx", line 980, in pandas.lib.astype_intsafe (pandas/lib.c:17409)
File "pandas/src/util.pxd", line 93, in util.set_value_at_unsafe (pandas/lib.c:72104)
ValueError: invalid literal for int() with base 10: "['E'"
答案 0 :(得分:1)
我认为您需要先使用strip
和split
将strings
转换为lists
。
然后使用链来改变价值并重塑形状。如有必要,最后将列转换为int。
from itertools import chain
#temporaly display content
with pd.option_context('display.max_colwidth', 120):
print (df)
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 222]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 3333]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 3334]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 33]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 444]]
df.character_position = df.character_position.str.strip('[]').str.split(', ')
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame(np.array(list(chain.from_iterable(df.character_position))).reshape(-1,5),
columns=cols)
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
MultiIndex
的另一个解决方案,并按stack
重新整形:
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444