Question

TL; DR - 将推文拉成CSV文件。有些包含随机隐形转义字符。由于这个原因，CSV阅读器“错误地”读取了行尾。我该怎么做才能阻止这个？

这是Notepad ++中线条的图像。该行比这长得多，但程序将黑/白“SUB”视为文件结尾，并移动到我目录中的下一个文件。 http://imgur.com/HfBzS6D

尝试读取大约300,000行的文件。

我怀疑有些东西丢失了（只记录了一些东西，所以起初很难注意到）所以我决定在读取我的一个文件时计算行数。

程序在这些行的第三行之后停止。这是2969行，2970行300,000处理的最后一行是2970。

Processing Row : 2968
['@HillaryClinton', '16196000', 'NA', 'Brian', 'brianwatkins74', '44', '25', 'Virgina Beach, VA', '0', '0', 'Sun Nov 01 02:17:53 +0000 2015', '@HillaryClinton Trump has MORE followers now!!!! :) Sweet!!!!!!!!!!!!!!!!!!', '6.61E+17', '6.60642E+17', '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'NA', 'NA', 'NA', '1339835893', 'NA']
Processing Row : 2969
['@HillaryClinton', '617660000', 'NA', 'Logan Jackson', 'mischivishlj', '989', '1379', '', '0', '0', 'Sun Nov 01 02:17:46 +0000 2015', 'RT @BookOfTamara: @HillaryClinton @CBSNews this tweet about an article your intern read really shows your dedication to the issue', '6.61E+17', '6.60642E+17', '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'NA', 'NA', 'NA', '1339835893', 'NA']
Processing Row : 2970
['@HillaryClinton', '1224300000', 'NA', '\x80\xb3\xda\x18', 'gingasenki', '5', '1', '', '0', '0', 'Sun Nov 01 02:17:43 +0000 2015', '.@HillaryClinton No TPP for Japan nor the World! #\xfd']

我在Excel / Notepad ++中检查过整个文件，看不到任何不合适的地方。

处理文件的代码如下：

with open(curFile, 'r') as csvFile :
    fileReader = csv.reader(csvFile, delimiter=',')

    header = fileReader.next()

    rowCount = 0 

    for row in fileReader :

        ## Get the index of the UserID_str

        userID_str_index = header.index("userID_str")

        ## Obtain the userID_str

        userID_str = row[userID_str_index]

        listOfIDs[candidateCount].append(userID_str)

我删除的只是一些if / then用于错误测试哪一行导致停止。

所有这一切应该通过所有300,000行，并存储到列表列表当前行的userID。

Edit1：如果python打印出它处理的最后一行，那么你可以看到python本身如何打印它，而不是它在我的CSV文件中的显示方式。

Processing Row : 2970
['@HillaryClinton', '1224300000', 'NA', '\x80\xb3\xda\x18', 'gingasenki', '5', '1', '', '0', '0', 'Sun Nov 01 02:17:43 +0000 2015', '.@HillaryClinton No TPP for Japan nor the World! #\xfd']

EDIT2：澄清一下，这没有错误。它只是停止处理文件，好像它已经到了尽头！我让它打印出每行/通过它处理的行数。根据它，它击中了这一行（上面印刷的第二个）＃2970，然后不再。确认还有298,000多人才能通过。

CSV阅读器不读取整个文件（python）

0 个答案: