使用python清理凌乱的CSV,保存在excel中

时间:2018-01-23 14:38:25

标签: python pandas csv encoding utf-8

我是使用python的新手,我试图将CSV文件的加载(100s)读入一个数据帧。但是,csv文件非常混乱,使用多个分隔符等。我试着搜索这个网站,但我找到的所有东西都不起作用。我尝试过readlines和pd.read有很多选项,但我得到的只是错误或空数据帧。当我在excel中打开CSV时它看起来很好,当我将它保存为UTF-8 csv时,一切正常。但是,对每个excel文件执行此操作非常有用,即使使用宏也是如此。有没有办法使用python代码复制这个过程,例如in2csv?下面我提供了我需要使用的csv文件的一部分,以及来自excel的部分csv(可行)。对我来说,看起来主要的区别是空格和逗号分隔符,但在pd.read中更改它并没有帮助。非常感谢提前!

凌乱的csv:

"Device name:UU-WGB-JV_1  Device type:SUN2000  Device address:IP Address=62.72.193.88   Device No.=2  Date:2018-01-23 08:51:23  "   
"Generated On"  "Device Status" "Energy Yield of Current Day (kWh)" "Inv. efficiency"(%)    "Total Energy Yield (kWh)"  "Input Power (kW)"  "Active Power (kW)" "Reactive Power (kVar)" "Power Factor"  "Grid Frequency (Hz)"   "Grid A Current (A)"    "Grid B Current (A)"    "Grid C Current (A)"    "Grid A Phase Voltage (V)"  "Grid B Phase Voltage (V)"  "Grid C Phase Voltage (V)"  "PV1 Input Current (A)" "PV2 Input Current (A)" "PV3 Input Current (A)" "PV4 Input Current (A)" "PV5 Input Current (A)" "PV6 Input Current (A)" "PV1 Input Voltage (V)" "PV2 Input Voltage (V)" "PV3 Input Voltage (V)" "PV4 Input Voltage (V)" "PV5 Input Voltage (V)" "PV6 Input Voltage (V)" "Cabinet Temperature (℃)"   
"2017-12-22 00:00:00    "   "Idle: No irradiation"  "0.00"  "0.00"  "45803.34"  "0.000" "0.000" "0.000" "0.000" "0.00"  "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"
"2017-12-22 00:15:00    "   "Idle: No irradiation"  "0.00"  "0.00"  "45803.34"  "0.000" "0.000" "0.000" "0.000" "0.00"  "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"

好的CSV:

Device name:UU-CB_1  Device type:SUN2000  Device address:IP Address=62.140.137.136   Device No.=1  Date:2018-01-22 13:31:51  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Generated On,Device Status,Energy Yield of Current Day (kWh),Inv. efficiency(%),Total Energy Yield (kWh),Input Power (kW),Active Power (kW),Reactive Power (kVar),Power Factor,Grid Frequency (Hz),Grid A Current (A),Grid B Current (A),Grid C Current (A),Grid A Phase Voltage (V),Grid B Phase Voltage (V),Grid C Phase Voltage (V),PV1 Input Current (A),PV2 Input Current (A),PV3 Input Current (A),PV4 Input Current (A),PV5 Input Current (A),PV6 Input Current (A),PV1 Input Voltage (V),PV2 Input Voltage (V),PV3 Input Voltage (V),PV4 Input Voltage (V),PV5 Input Voltage (V),PV6 Input Voltage (V),Cabinet Temperature (℃)
"2017-11-01 00:00:00    ",Idle: No irradiation,0,-,36670.07,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

4 个答案:

答案 0 :(得分:1)

似乎第一行(标题)行是不可撤销的,因为它包含空格&未加引号的字段。可以由特定的正则表达式修复。我会跳过它。

其余行不是csv,而是包含以空格分隔的引用标记,为shlex.split轻而易举:

import shlex,csv

with open("input.csv") as f:
   title = next(f)   # discard title line

   with open("output.csv","w",newline="",encoding="utf-8") as fw:
      cw = csv.writer(fw,delimiter=";")  # may be changed to ","
      cw.writerows(shlex.split(l) for l in f)

输出:

Generated On;Device Status;Energy Yield of Current Day (kWh);Inv. efficiency(%);Total Energy Yield (kWh);Input Power (kW);Active Power (kW);Reactive Power (kVar);Power Factor;Grid Frequency (Hz);Grid A Current (A);Grid B Current (A);Grid C Current (A);Grid A Phase Voltage (V);Grid B Phase Voltage (V);Grid C Phase Voltage (V);PV1 Input Current (A);PV2 Input Current (A);PV3 Input Current (A);PV4 Input Current (A);PV5 Input Current (A);PV6 Input Current (A);PV1 Input Voltage (V);PV2 Input Voltage (V);PV3 Input Voltage (V);PV4 Input Voltage (V);PV5 Input Voltage (V);PV6 Input Voltage (V);Cabinet Temperature (℃)
2017-12-22 00:00:00    ;Idle: No irradiation;0.00;0.00;45803.34;0.000;0.000;0.000;0.000;0.00;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0
2017-12-22 00:15:00    ;Idle: No irradiation;0.00;0.00;45803.34;0.000;0.000;0.000;0.000;0.00;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0

现在可以在excel中正确打开文件(请注意,默认情况下,各种版本的excel都需要逗号或分号分隔符)

enter image description here

答案 1 :(得分:0)

这是一种为初学者处理大型csv文件的更直观的方法。这允许您一次处理行组或块。

var penny = 0.01;
for (var i = 0; i < 365; i++) {
  console.log(penny);
  penny = penny*2;
}
console.log(penny);

您可以在此处查看http://pandas.pydata.org/

Pandas是一个用于大数据的高性能数据分析库。

答案 2 :(得分:0)

您可以将csv文件作为字符串读取,然后使用正则表达式来处理拆分。通常,字段分隔符是逗号,分号或制表符,而行以\ n结尾,因此读取可能如下所示:

import re

data = []
with open("yourfile.csv", "r") as csvfile:
    for line in csvfile:
        data.append(re.split("[\,\;\t\n]", line)[:-1])

现在数据是包含您的数据的列表列表,这些列表应该很容易转换为数据帧或其他任何内容。我在分割中包含\ n,因为在我的测试中,行结尾仍然是行的一部分。这只是一个例子,毫无疑问,您将希望从中创建一个函数并使其适应您的用例。

答案 3 :(得分:0)

因此,这主要是一个编码问题。我使用了一个名为cpconverter的.exe来将编码从unicode(1200)更改为utf-8。现在pd.read在我传递sep ='\ t'时有效。如果我可以使用python脚本(或使用原始编码)更改此编码会更好,但现在它可以工作。感谢您的所有努力和帮助!

编辑:将encoding ='utf-16'传递给pd.read_csv现在可以解决所有问题。不要忘记我是怎么想的,但原来的编码显然是utf-16。