查找可以转换字符串的最严格的数据类型(Python)

时间:2016-12-26 17:00:23

标签: python

我想读取CSV等文本文件或制表符分隔值,但只能读取可以转换为数字的列。例如,如果列只包含字符串,我不想阅读它。这是因为我想避免使用多种数据类型的numpy数组(我很想知道为什么我应该这样做,但问题不应该讨论)。

各种问题与我的问题非常接近(请参阅123)。但是,除了1之外,它们专注于转换字符串而不是检查它可以转换为哪种数据类型,这就是为什么我大多数时间1来获得所需的结果。

Numpy's genfromtxt已经做了一些事情(通过使用“dtype”参数)。我想我可以使用它(将“dtype”设置为None),然后只检查每列的数据类型。

这是我到目前为止所拥有的:

def data_type(string_to_test):
"""
Checks to which data type a string can be cast. The main goal is to convert strings to floats.
The hierarchy for upcasting goes like this: int->float->complex->string. Thus, int is the most restrictive.
:param string_to_test: A string of character that hypothetically represents a number
:return: The most restrictive data type to which the string could be cast.
"""
# Do this to convert also floats that use coma instead of dots:
string_to_test = string_to_test.replace(',', '.')
# First, try to switch from string to int:
try:
    # This will yield True if string_to_test represents an int, or a float that is equal to an int (e.g.: '1.0').
    if int(float(string_to_test)) == float(string_to_test):
        return int
    else:
        int(string_to_test)
        return float
except ValueError:
    # If it doesn't work, try switching from string to float:
    try:
        float(string_to_test)
        return float
    except ValueError:
        # Happens with complex numbers and types (e.g.: float(4 + 3j), or float(float64)).
        # If this still doesn't work, try switching from string to complex:
        try:
            # Make sure spaces between operators don't cause any problems (e.g.: '1 + 4j' will not work,
            # while '1+4j' will).
            complex(string_to_test.replace(' ', ''))
            return complex
        # If none of the above worked, the string is said not to represent any other data types  (remember this
        # function is supposed to be used on data that is read from files, so checking only for those types should
        #  be exhaustive enough).
        except ValueError:
            return str

我最大的问题是我发现它相当丑陋,而且可能有一些我没想过的情况。因此,我的问题是“它能以更好的方式完成吗?”。

另外,我很想知道什么时候返回表示该数据类型的字符串而不是类本身更好(例如:将'complex'作为字符串而不是复杂的类返回)。例如,我知道在将方法astype用于numpy数组时我可以同​​时使用它们(字符串或类)。

提前致谢!

1 个答案:

答案 0 :(得分:1)

相同的逻辑,更少 - "丑陋"介绍:

def data_type(string_to_test, types=(int,float,complex)):
    string_to_test = string_to_test.replace(' ', '')
    for typ in types:
        try: value = typ(string_to_test)
        except ValueError: pass
        else: break
    else: typ = str 
    # special cases:
    if typ is float and int in types and value == int(value): typ = int
    if typ is int and bool in types and value == bool(value): typ = bool
    return typ

这也使得自己更容易扩展,因为你可以传递types的不同层次结构 - 请注意,类似于你的规则"沸腾" float int int,如果bool是所需要的bool,我还会将type进一步归入.__name__。类型(默认情况下它不是,因为你没有在问题中指定它,但它可能是)。

我会保留生成的x对象,原则是在您不需要时不丢弃信息(如果您需要字符串,则可以随时访问from matplotlib import pyplot as plt x0 = [0.0, 0.5, 2.0] y0 = [1.0, 1.5, 1.0] # x1 = [0.0, 1.5, 2.0] y1 = [1.0, 1.5, 1.0] plt.stackplot(x0, (y0, y1)) plt.show() 。)< / p>