Question

我有一个.txt文件，当前的格式如下：

John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
...

第一列永远不会缺少任何值。

我正在尝试使用python将其转换为.csv文件。如果每行都有所有列数据，我知道该怎么做，但是我的.txt在某些列中缺少一些数据。如何在确保同一列中保留相同类型的数据的同时将其转换为.csv？谢谢：）

Answer 1

以逗号分隔。您知道模式应该是www.word.word模式中的word，word，int（我假设）和字符串。

如果前面只有一个单词而不是2个单词，请在第一个单词之后添加一个逗号。
如果缺少数字，请在第二个单词后添加一个逗号。
等等...

说您得到一行“ Susie，www.regexr.com”，您知道缺少一个单词和一个数字。在第一个单词之后添加2个逗号。

本质上是一堆if语句或switch-case语句。

也许有一种更优雅的方法，但是整个早上处理服务器和电话问题令我沮丧。

此未经测试，我希望我不要让自己感到尴尬：

    import re

    #read_line is a line read from the csv
    split_line = read_line.split(',')
    num_elements = len(split_line) #do this only once for efficiency
    if (num_elements == 3): #Need to add an element somewhere, depending on what's missing
        if(re.search('[^@]+@[^@]+\.[^@]+',split_line[2])): #Starting at the last element, if it is an email address
            if(re.search('[\d]',split_line[1])): #If the previous element is a digit
                #if so, add a comma as the only element missing is the string at split_line[1]
                read_line = split_line[0]+','+','+split_line[1]+','+split_line[2]
            else:
                #if not so, add a comma at split_line[2]
                read_line = split_line[0]+','+split_line[1]+','+','+split_line[2]
        else:
            #last element isn't email address, add a comma in its place
            read_line = split_line[0]+','+split_line[1]+','+split_line[2]+','

    elif (num_elements == 2) #need two elements, first one is assumed to always be there
        if(re.search('[^@]+@[^@]+\.[^@]+',split_line[1])): #The second element is an email address
            #Insert 2 commas in for missing string and number
            read_line = split_line[0]+',,,'+split_line[1]
        elif(re.search('[\d]',split_line[1])): #The second element contains digits
            #Insert commas for missing string and email address
            read_line = split_line[0]+',,'+split_line[1]+','
        else:
            #Insert commas for missing number and email address
            read_line = split_line[0]+','+split_line[1]+',,'
    elif (num_elements == 1):
        read_line = split_line[0]+',,,'

Answer 2

我考虑了您的问题，我只能提供一个半熟的解决方案作为您的CSV文件，如果缺少数据，请不要使用,,之类的内容来显示它。

您当前的csv文件就是这样

John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com

如果您找到一种更改CSV文件格式的方法

John,bread,17,www.google.com
Emily,apples,24,
Anita,,35,www.website.com
Charles,banana,,www.stackoverflow.com
Susie,french fries,31,www.regexr.com

您可以使用以下解决方案。有关信息，我已将您的输入内容输入文本文件

In [1]: import pandas as pd   
In [2]: population = pd.read_csv('input_to_csv.txt')
In [3]: mod_population=population.fillna("NaN")
In [4]: mod_population.to_csv('output_to_csv.csv',index=False)

Answer 3

如果可以假设某种统一性，则建议进行正则表达式检查。例如，构建一个正则表达式模式列表，因为每条数据似乎都是不同的。

如果您读入的第二列匹配所有字符和空格，则可能是食物。另一方面，如果是数字匹配，则应假定食物缺失。如果这是网址匹配项，则您都错过了。您将需要对测试用例进行全面的了解，但是如果实际数据与您的示例相似，则您有3个相对独特的用例，分别是字符串，整数和url。这应该使编写正则表达式任务相对简单。导入re并使用re.search应该可以帮助您测试每个正则表达式而无需太多开销。

当某些行的某些列缺少数据时将.txt转换为.csv（python）

3 个答案: