从文件夹

时间:2019-01-20 07:18:46

标签: python pandas dataframe unicode

在同一位置有2个csv文件: 1-候选人.csv 2- Store.csv

当我在使用此代码的同时导入候选人.csv filw时,它将被导入:

data=pandas.read_csv("C:\\Users\\Nupur\\Desktop\\Ankit\\candidates.csv")

但是当我使用相同的代码导入Store.csv文件时,出现错误:

data=pandas.read_csv("C:\\Users\\Nupur\\Desktop\\Ankit\\Store.csv")

错误:

  

UnicodeDecodeError跟踪(最近的调用)   最后)pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._convert_tokens()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._convert_with_dtype()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._string_convert()

     

pandas._libs.parsers._string_box_utf8()中的pandas_libs \ parsers.pyx

     

UnicodeDecodeError:“ utf-8”编解码器无法解码位置中的字节0xf6   9:无效的起始字节

在处理上述异常期间,发生了另一个异常:

  

UnicodeDecodeError跟踪(最近的调用)   最后)   ----> 1个data = pandas.read_csv(“ C:\ Users \ Nupur \ Desktop \ Ankit \ Store.csv”)

     

C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py在   parser_f(filepath_or_buffer,sep,分隔符,标头,名称,index_col,   usecols,squeeze,前缀,mangle_dupe_cols,dtype,引擎,转换器,   true_values,false_values,skipinitialspace,skiprows,nrows,   na_values,keep_default_na,na_filter,详细,skip_blank_lines,   parse_dates,infer_datetime_format,keep_date_col,date_parser,   dayfirst,迭代器,chunksize,压缩,数千,十进制,   换行符,quotechar,引用,escapechar,注释,编码,   方言,tupleize_cols,error_bad_lines,warn_bad_lines,skipfooter,   双引号,delim_whitespace,low_memory,memory_map,   float_precision)       676 skip_blank_lines = skip_blank_lines)       677   -> 678 return _read(filepath_or_buffer,kwds)       679       680 parser_f。名称 =名称

     

C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py在   _read(filepath_or_buffer,kwds)       444       445尝试:   -> 446数据= parser.read(行)       447终于:       448 parser.close()

     

C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py在   读取(自己,向上)1034引发ValueError('skipfooter   不支持迭代')1035   -> 1036 ret = self._engine.read(nrows)1037 1038#可能会更改列/ col_dict

     

C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py在   read(self,nrows)1846 def read(self,nrows = None):1847
  尝试:   -> 1848数据= self._reader.read(nrows)1849,但StopIteration除外:如果self._first_chunk为1850:

     

pandas._libs.parsers.TextReader.read()中的pandas_libs \ parsers.pyx

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._read_low_memory()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._read_rows()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._convert_column_data()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._convert_tokens()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._convert_with_dtype()

     

pandas_libs \ parsers.pyx在   pandas._libs.parsers.TextReader._string_convert()

     

pandas._libs.parsers._string_box_utf8()中的pandas_libs \ parsers.pyx

     

UnicodeDecodeError:“ utf-8”编解码器无法解码位置中的字节0xf6   9:无效的起始字节

3 个答案:

答案 0 :(得分:1)

尝试使用它,

data=pandas.read_csv("C:\\Users\\Nupur\\Desktop\\Ankit\\Store.csv",encoding = "ISO-8859-1")

答案 1 :(得分:1)

如果由于文件上的编码不是pd.read_csv()文档中提到的默认编码而导致编码错误,则可以先安装chardet,然后再执行以下操作,找到文件的编码代码:

import chardet    
rawdata = open('D:\\path\\file.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print(charenc)

这将为您提供文件的编码。

一旦有了编码,就可以读为:

pd.read_csv('D:\\path\\file.csv',encoding = 'encoding you found')

pd.read_csv(r'D:\path\file.csv',encoding = 'encoding you found')

您将获得所有编码here的列表

希望您觉得这有用。

答案 2 :(得分:0)

您尝试过

#include "utilities.h"
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <syslog.h>
#include <sys/time.h>//getrlimit
#include <sys/resource.h>//getrlimit
#include <signal.h> //sigempyset , asigcation (umask?)
#include <sys/resource.h>
#include <fcntl.h> //O_RDWR
#include <stdarg.h>

#include "error.h"

/*The function creates a daemon*/
int daemonize(const char *cmd)
{
    int fd0, fd1, fd2;
    unsigned int i;
    pid_t pid;
    struct rlimit       rl;
    struct sigaction    sa;
    /* Clear file creation mask.*/
    umask(0);
    /* Get maximum number of file descriptors. */
    if (getrlimit(RLIMIT_NOFILE, &rl) < 0)
    {
        err_quit("%s: can’t get file limit", cmd);
    }
    /* Become a session leader to lose controlling TTY. */
    if ((pid = fork()) < 0)
    {
        err_quit("%s: can’t fork", cmd);
    }
    else if (pid != 0) /* parent */
    {
        exit(0); //the parent will exit
    }
    setsid();
    /* Ensure future opens won’t allocate controlling TTYs. */
    sa.sa_handler = SIG_IGN;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    if (sigaction(SIGHUP, &sa, NULL) < 0)
    {
        err_quit("%s: can’t ignore SIGHUP", cmd);
    }
    if ((pid = fork()) < 0)
    {
        err_quit("%s: can’t fork", cmd);
    }
    else if (pid != 0) /* parent */
    {
        exit(0);
    }
    /*
    * Change the current working directory to the root so
    * we won’t prevent file systems from being unmounted.
    */
    if (chdir("/") < 0)
    {
        err_quit("%s: can’t change directory to /", cmd);
    }
    /* Close all open file descriptors. */
    if (rl.rlim_max == RLIM_INFINITY)
    {
        rl.rlim_max = 1024;
    }
    printf("closing file descriptors\n");
    for (i = 0; i < rl.rlim_max; i++)
    {
        close(i);
    }
    /* Attach file descriptors 0, 1, and 2 to /dev/null.*/
    //printf not working
    /*printf("closed all file descriptors for daemonizing\n");*/
    fd0 = open("/dev/null", O_RDWR);
    fd1 = dup(0);
    fd2 = dup(0);
    /* Initialize the log file. Daemons do not have a controlling terminal so
    they can't write to stderror. We don't want them to write to the console device
    because on many workstations the control device runs a windowing system. They can't
    write on separate files either. A central daemon error-logging facility is required.
    This is the BSD. 3 ways to generate log messages:
    1) kernel routines call the log function. These messages can be read from /dev/klog
    2) Most user processes (daemons) call syslog to generate log messages. This causes
    messages to be sent to the UNIX domain datagram socket /dev/log
    3) A user process on this host or on other host connected to this with TCP/ID
    can send log messages to UDP port 514. Explicit network programmin is required
    (it is not managed by syslog.
    The syslogd daemon reads al three of log messages.

    openlog is optional since if not called, syslog calls it. Also closelog is optional
    openlog(const char *ident, int option, int facility)
    It lets us specify ident that is added to each logmessage. option is a bitmask:
        LOG_CONS tells that if the log message can't be sent to syslogd via UNIX
        domain datagram, the message is written to the console instead.
    facility lets the configuration file specify that messages from different
    facilities are to be handled differently. It can be specified also in the 'priority'
    argument of syslog. LOG_DAEMON is for system deamons
    */
    openlog(cmd, LOG_CONS, LOG_DAEMON);
    if (fd0 != 0 || fd1 != 1 || fd2 != 2)
    {
        /*This generates a log mesage.
        syslog(int priority, const char *fformat,...)
        priority is a combination of facility and level. Levels are ordered from highest to lowest:
        LOG_EMERG: emergency system unusable
        LOG_ALERT: condiotin that must be fied immediately
        LOG_CRIT: critical condition
        LOG_ERR: error condition
        LOG_WARNING
        LOG_NOTICE
        LOG_INFO
        LOG_DEBUG

        format and other arguements are passed to vsprintf function forf formatting.*/
        syslog(LOG_ERR, "unexpected file descriptors %d %d %d", fd0, fd1, fd2);
        exit(1);
    }
    return 0;
}

/*The function set the FD_CLOEXEC flag of the file descriptor already open that
is passed to as parameter. FD_CLOEXEC causes the file descriptor to be
automatically and atomically closed when any of the exec family function is
called*/
int set_cloexec(int fd)
{
    int val;
    /* retrieve the flags of the file descriptor */
    if((val = fcntl(fd, F_GETFD, 0))<0)
    {
        return -1;
    }
    /* set the FD_CLOEXEC file descriptor flag */
    /*it causes the file descriptor to be automatically and atomically closed
     when any of the exec family function is called*/
    val |= FD_CLOEXEC;
    return (fcntl(fd, F_SETFD, val));
}

如果上述方法不起作用,则表明您的编码格式不同,我建议为Windows选择少量编码,例如data=pandas.read_csv("C:\\Users\\Nupur\\Desktop\\Ankit\\Store.csv", encoding='utf-8') encoding='iso-8859-1'encoding='cp1252'

或者尝试在文件名前面添加encoding='latin1',以便将其视为“ r”,这样反斜杠就不会被特殊对待:

raw string