Question

熊猫中有没有选择＆＃39; read_csv功能可以自动将object dtype的每个项目转换为str。

例如，尝试读取CSV文件时会出现以下情况：

mydata = pandas.read_csv(myfile, sep="|", header=None)

C:\...\pandas\io\parsers.py:1159: DtypeWarning: Columns (6,635) have mixed types. Specify dtype option on import or set low_memory=False. data = self._reader.read(nrows)

是否存在这样的方式：（i）警告被禁止打印，但是（ii）我可以在字符串中捕获警告消息，从中我可以提取特定列，例如在这种情况下6和635（以便我可以随后修复dtype）？或者，如果我可以在mixed types时指定，read_csv函数应该将该列中的值转换为str？

我使用的是Python 3.4.2和Pandas 0.15.2

Answer 1

Dtypewarning是Warning，可以抓住并采取行动。有关详细信息，请参阅here。要捕获警告，我们需要将执行包装在warnings.catch_warnings块中。可以使用regex提取受影响的警告消息和列，然后使用.astype(target_type)

设置正确的列类型

import re
import pandas 
import warnings

myfile = 'your_input_file_here.txt'
target_type = str  # The desired output type

with warnings.catch_warnings(record=True) as ws:
    warnings.simplefilter("always")

    mydata = pandas.read_csv(myfile, sep="|", header=None)
    print("Warnings raised:", ws)
    # We have an error on specific columns, try and load them as string
    for w in ws:
        s = str(w.message)
        print("Warning message:", s)
        match = re.search(r"Columns \(([0-9,]+)\) have mixed types\.", s)
        if match:
            columns = match.group(1).split(',') # Get columns as a list
            columns = [int(c) for c in columns]
            print("Applying %s dtype to columns:" % target_type, columns)
            mydata.iloc[:,columns] = mydata.iloc[:,columns].astype(target_type)

结果应该与DataFrame相同，且有问题的列设置为str类型。值得注意的是，Pandas DataFrame中的字符串列报告为object。

Answer 2

如错误消息本身所述，避免pd.read_csv返回混合dtypes的最简单方法是设置low_memory=False：

df = pd.read_csv(..., low_memory=False)

但是，使用pd.concat连接多个数据帧时，这种奢侈不可用。

pandas read_csv将混合类型列转换为字符串

2 个答案: