Question

我们有几个大的.tsv文件，并尝试从中提取数据。我们将列列表作为参数传递，但是我们得到了重复的ValueError: Duplicate names are not allowed.。

但是，如果将值传递为names=['coming_from','article','referrer_type','n']

，就没有问题

这是我的代码

import datetime,json,pyodbc
import pandas as pd

class LoadData:
    def __init__(self,colname):
        self.colsname=colname

    def _read_file_extract_col_data_convert_data_into_json(self):
        _filepath_name = r"E:\PythonScripts\PyCharm\files\source\clickstream-enwiki-2018-12.tsv"
        col_names='['+self.colsname+']'
        for chunk in (pd.read_csv(_filepath_name, delimiter="\t", header=0,names=col_names, 
        chunksize=1,mangle_dupe_cols=True)):
        json_chunk = chunk.to_json(orient="records", force_ascii=True, default_handler=None)
        print(json_chunk)

list=[]
collist="'coming_from'","'article'","'referrer_type'","'n'"


p1=LoadData(','.join(collist))
p1._read_file_extract_col_data_convert_data_into_json()

错误：

回溯（最近通话最近）：文件 “ E：/PythonScripts/PyCharm/PythonScripts_withPyCharm/DataIngestionScripts/File_To_JSON.py”，第54行 p1._read_file_extract_col_data_convert_data_into_json（）文件“ E：/PythonScripts/PyCharm/PythonScripts_withPyCharm/DataIngestionScripts/File_To_JSON.py”， _read_file_extract_col_data_convert_data_into_json中的第17行对于（pd.read_csv（_filepath_name，delimiter =“ \ t”，header = 0，names = col_names，chunksize = 1，mangle_dupe_cols = True）的块:)文件 “ C：\ python_customize_install_location \ lib \ site-packages \ pandas \ io \ parsers.py”， read_csv中的第686行返回_read（filepath_or_buffer，kwds）文件“ C：\ python_customize_install_location \ lib \ site-packages \ pandas \ io \ parsers.py”， _read中的第449行 _validate_names（kwds.get（“ names”，None））文件“ C：\ python_customize_install_location \ lib \ site-packages \ pandas \ io \ parsers.py”， _validate_names中的第415行引发ValueError（“不允许重复的名称。”）ValueError：不允许重复的名称。

Answer 1

代码的第一个问题是如何处理列名列表。最初是单个字符串。即使您运行col_names='['+self.colsname+']'，它仍然仍然是单个字符串（用方括号括起来），而应该是列名的列表。

第二个问题是，当您使用 header = 0 调用 read_csv 时， names = ... 参数一起表示，这表示：

行 0 确实包含列名，
但是您用自己的名称（列表）覆盖。

我的经验表明，此处以名称传递的列列表应具有与实际数据列数相同的长度，否则各不相同可能会发生“副作用”。

我的建议是将您的代码更改为如下所示：

class LoadData:
    def __init__(self, colnames):
        self.colnames = colnames

    def _read_file_extract_col_data_convert_data_into_json(self):
        _filepath_name = r"Input.tsv"
        i = 0
        for chunk in (pd.read_csv(_filepath_name, delimiter="\t",
                header=0, names=self.colnames, chunksize=1)):
            print(f'chunk {i}:'); i += 1
            print(chunk)
            json_chunk = chunk.to_json(orient="records")
            print(json_chunk)

collist = ['coming_from', 'article', 'referrer_type', 'n']
p1 = LoadData(collist)
p1._read_file_extract_col_data_convert_data_into_json()

请注意：

我将输入文件名传递为 _filepath_name 。在您的版本中将其更改为您的文件名。
我添加了一些附加的跟踪打印输出，并将其放入最终版本中。
通过 mangle_dupe_cols 在这里没有意义，因为您覆盖了具有新列表的现有列列表，无重复。
为保持代码简洁，在调用 to_json 时，我删除了 default_handler 和 force_ascii ，因为它们的默认值只是无和 True 。

我将输入文件（ Input.tsv ）准备为：

aa  bb  cc  dd
a1  a2  a3  a4
b1  b2  b3  b4
c1  c2  c3  c4

我得到的结果是：

chunk 0:
  coming_from article referrer_type   n
0          a1      a2            a3  a4
[{"coming_from":"a1","article":"a2","referrer_type":"a3","n":"a4"}]
chunk 1:
  coming_from article referrer_type   n
1          b1      b2            b3  b4
[{"coming_from":"b1","article":"b2","referrer_type":"b3","n":"b4"}]
chunk 2:
  coming_from article referrer_type   n
2          c1      c2            c3  c4
[{"coming_from":"c1","article":"c2","referrer_type":"c3","n":"c4"}]

代码失败的另一个原因可能是您在其中的更多列输入的文件要比名称列表中的文件大。在这种情况下：

在名称中传递的列表仅覆盖最终列
但初始列没有名称（在 to_json 中将被省略。

将列列表作为参数传递给pandas read_csv

1 个答案: