Question

我有一个数据集，我需要从该数据集重构一些数据为新样式

我的数据集如下所示（存储在名为train1.txt的文件中）：

2342728，2414939，2397722，2386848，2398737，2367906，2384003，2399896，2359702，2414293，2411228，2416802，2322710，2387437，2397274，2344681，2396522，2386676，2413824，2328225，2413833，2335374，2328594，497 ，2384001，2372746，2386538，2348518，2380037，2374364，2352054，2377990，2367915，2412520，2348070，2356469，2353541，2413446，2391930，2366968，2364762，2347618，2396550，2370538，2393212，2364244、2387901、4752 ，2331890，2341328，2413686，2359209，2342027，2414843，2378401，2367772，2357576，2416791，2398673，2415237，2383922，2371110，2365017，2406357，2383444，2385709，2392694，2378109，2394742，2318516，2354062，2380081 ，2328407，2396727，2316901，2400923，2360206，971，2350695，2341332，2357275，2369945，2325241、2408952、2322395、2415137、2372785、2382132、2323580、2368945、2413009，2348581、2365287、2408766、2382349、2355549，，2374616，2344619，2362449，2380907，2327352，2347183，2384375，2368019，2365927，2370027，2343649，2415694、233503 5，2389182，2354073，2363977，2346358，2373500，2411328，2348913，2372324，2368727，2323717，2409571，2403981，2353188，2343362，285721，2376836，2368107，2404464，2417233，2382750，2366329，675，2360991，2341475， 2346242，2391969，2345287，2321367，2416019，2343732，2384793，2347111，2332212，138，2342178，2405886，2372686，2365963，2342468

我需要转换为以下样式（我需要将新文件存储为train.txt）：

2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….

我的python版本是2.7.13 我的操作系统是Ubuntu 14.04 LTS 感谢您的帮助。非常感谢。

Answer 1

我建议使用正则表达式（正则表达式）。这可能有点矫kill过正，但是从长远来看，知道正则表达式非常强大。

import re
def return_no_commas(string):
    regex = r'\d*'
    matches = re.findall(regex, string)
    for match in matches:
        print(match)


numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""

return_no_commas(numbers)

让我解释一下所有功能。

import re

只导入正则表达式。我写的正则表达式是

regex = r'\d*'

开头的“ r”表示这是一个正则表达式，它只查找任何数字（即“ \ d”部分），并表示可以重复任意次数（即“ *”部分）。然后我们打印出所有匹配项。

我将您的数字保存在名为数字的字符串中，但是您可以轻松地在文件中读取并使用这些内容。

您会得到类似的东西：

Answer 2

在我看来，您的原始数据用逗号分隔。但是，您希望数据用换行符（\n）分隔。这很容易做到。

def covert_comma_to_newline(rfilename, wfilename):
    """
    rfilename -- name of file to read-from
    wfilename -- name of file to write-to
    """
    assert(rfilename != wfilename)
    # open two files, one in read-mode
    # the other in write-mode
    rfile = open(rfilename, "r")
    wfile = open(wfilename, "w")

    # read the file into a string
    rstryng = rfile.read()

    lyst = rstryng.split(",")
    # EXAMPLE:
    #     rstryng == "1,2,3,4"
    #     lyst    == ["1", "2", "3", "4"]

    # remove leading and trailing whitespace
    lyst = [s.strip() for s in lyst]

    wstryng = "\n".join(lyst)
    wfile.writelines(wstryng)
    rfile.close()
    wfile.close()
    return


covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

Answer 3

由于其他人已经添加了答案，因此我将使用numpy来添加答案。如果您可以使用numpy，那么就很简单：

 data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')

如果要使用列表而不是numpy数组，则

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]

如何使用python重建和更改数据集的结构？

3 个答案: