我有一个数据集,我需要从该数据集重构一些数据为新样式
我的数据集如下所示(存储在名为train1.txt的文件中):
2342728,2414939,2397722,2386848,2398737,2367906,2384003,2399896,2359702,2414293,2411228,2416802,2322710,2387437,2397274,2344681,2396522,2386676,2413824,2328225,2413833,2335374,2328594,497 ,2384001,2372746,2386538,2348518,2380037,2374364,2352054,2377990,2367915,2412520,2348070,2356469,2353541,2413446,2391930,2366968,2364762,2347618,2396550,2370538,2393212,2364244、2387901、4752 ,2331890,2341328,2413686,2359209,2342027,2414843,2378401,2367772,2357576,2416791,2398673,2415237,2383922,2371110,2365017,2406357,2383444,2385709,2392694,2378109,2394742,2318516,2354062,2380081 ,2328407,2396727,2316901,2400923,2360206,971,2350695,2341332,2357275,2369945,2325241、2408952、2322395、2415137、2372785、2382132、2323580、2368945、2413009,2348581、2365287、2408766、2382349、2355549, ,2374616,2344619,2362449,2380907,2327352,2347183,2384375,2368019,2365927,2370027,2343649,2415694、233503 5,2389182,2354073,2363977,2346358,2373500,2411328,2348913,2372324,2368727,2323717,2409571,2403981,2353188,2343362,285721,2376836,2368107,2404464,2417233,2382750,2366329,675,2360991,2341475, 2346242,2391969,2345287,2321367,2416019,2343732,2384793,2347111,2332212,138,2342178,2405886,2372686,2365963,2342468
我需要转换为以下样式(我需要将新文件存储为train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
我的python版本是2.7.13 我的操作系统是Ubuntu 14.04 LTS 感谢您的帮助。 非常感谢。
答案 0 :(得分:1)
我建议使用正则表达式(正则表达式)。这可能有点矫kill过正,但是从长远来看,知道正则表达式非常强大。
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
让我解释一下所有功能。
import re
只导入正则表达式。我写的正则表达式是
regex = r'\d*'
开头的“ r”表示这是一个正则表达式,它只查找任何数字(即“ \ d”部分),并表示可以重复任意次数(即“ *”部分)。然后我们打印出所有匹配项。
我将您的数字保存在名为数字的字符串中,但是您可以轻松地在文件中读取并使用这些内容。
您会得到类似的东西:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212
答案 1 :(得分:0)
在我看来,您的原始数据用逗号分隔。但是,您希望数据用换行符(\n
)分隔。这很容易做到。
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`
答案 2 :(得分:0)
由于其他人已经添加了答案,因此我将使用numpy
来添加答案。
如果您可以使用numpy
,那么就很简单:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
如果要使用列表而不是numpy数组,则
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]