我正在尝试将我的一些处理工作从R转移到Python。在R中,我使用read.table()来读取非常混乱的CSV文件,它会以正确的格式自动拆分记录。 E.g。
391788,"HP Deskjet 3050 scanner always seems to break","<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p>
<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p>
","windows-7 printer hp"
正确分为4列。 1条记录可以分成很多行,并且到处都有逗号。在R我只是做:
read.table(infile, header = FALSE, nrows=chunksize, sep=",", stringsAsFactors=FALSE)
Python中有什么能够同样做到这一点吗?
谢谢!
答案 0 :(得分:4)
您可以使用csv模块。
from csv import reader
csv_reader = reader(open("C:/text.txt","r"), quotechar="\"")
for row in csv_reader:
print row
['391788', 'HP Deskjet 3050 scanner always seems to break', "<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p>\n\n<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p>\n", 'windows-7 printer hp']
输出长度= 4
答案 1 :(得分:2)
pandas
模块还提供许多类似R的函数和数据结构,包括read_csv
。这里的优点是数据将作为pandas DataFrame
读入,这比标准的python列表或dict更容易管理(特别是如果你已经习惯了R)。这是一个例子:
>>> from pandas import read_csv
>>> ugly = read_csv("ugly.csv",header=None)
>>> ugly
0 1 \
0 391788 HP Deskjet 3050 scanner always seems to break
2 3
0 <p>I'm running a Windows 7 64 blah blah blah..... windows-7 printer hp