Question

我正在尝试使用python读取和分析大型csv文件（11.5 GB）。然后使用Power BI在其周围创建一些视觉效果。但是，每次我运行任何命令行甚至更改Power BI中的数据框时，每次更改之间大约需要20-30分钟。

列标题之一是DeviceID。我想将大型CSV拆分为多个csv文件，以便每个文件都具有属于一个唯一DeviceID值的数据。

当前，单个Full.csv文件中的数据框如下所示：

DeviceID    AreaName     Longitude    Latitude
12311       Dubai        55.55431     25.45631
12311       Dubai        55.55432     25.45634
12311       Dubai        55.55433     25.45637
12311       Dubai        55.55431     25.45621
12309       Dubai        55.55427     25.45627
12309       Dubai        55.55436     25.45655
12412       Dubai        55.55441     25.45657
12412       Dubai        55.55442     25.45656

运行代码后，单个Full.csv文件应产生3个csv文件：12311.csv，12309.csv，12412.csv，每个文件看起来像这样：

DeviceID    AreaName     Longitude    Latitude
12311       Dubai        55.55431     25.45631
12311       Dubai        55.55432     25.45634
12311       Dubai        55.55433     25.45637
12311       Dubai        55.55431     25.45621

AND

DeviceID    AreaName     Longitude    Latitude
12309       Dubai        55.55427     25.45627
12309       Dubai        55.55436     25.45655

AND

DeviceID    AreaName     Longitude    Latitude
12412       Dubai        55.55441     25.45657
12412       Dubai        55.55442     25.45656

我读到，处理python中大文件的最佳方法是使用pandasql模块。我可以使用pandsql实现上述功能吗？

谢谢

Answer 1

列标题之一是DeviceID。我想将大CSV拆分为多个csv文件，以便每个文件都具有属于一个唯一DeviceID值的数据。

我认为这不会加快您在PowerBI中的处理速度，您是在PowerQuery还是在PowerBI中自己进行计算？

但是无论如何，您可以为DeviceID创建一个唯一值列表：

df = pd.read_csv('Full.csv')
uniquelist = list(df['DeviceID'].unique())

，然后根据此列表将其拆分并保存到csv文件中：

for i in uniquelist:
   i = df.loc[df['DeviceID'] == i]
   i.to_csv

Answer 2

如果不是强制性的python，则可以使用Miller（https://github.com/johnkerl/miller）。

从

开始

DeviceID,AreaName,Longitude,Latitude
12311,Dubai,55.55431,25.45631
12311,Dubai,55.55432,25.45634
12311,Dubai,55.55433,25.45637
12311,Dubai,55.55431,25.45621
12309,Dubai,55.55427,25.45627
12309,Dubai,55.55436,25.45655
12412,Dubai,55.55441,25.45657
12412,Dubai,55.55442,25.45656

并运行

mlr --csv --from input.csv put -q 'tee > $DeviceID.".csv", $*'

您将拥有这三个文件

#12311.csv
DeviceID,AreaName,Longitude,Latitude
12311,Dubai,55.55431,25.45631
12311,Dubai,55.55432,25.45634
12311,Dubai,55.55433,25.45637
12311,Dubai,55.55431,25.45621

#12412.csv
DeviceID,AreaName,Longitude,Latitude
12412,Dubai,55.55441,25.45657
12412,Dubai,55.55442,25.45656

#12309.csv
DeviceID,AreaName,Longitude,Latitude
12309,Dubai,55.55427,25.45627
12309,Dubai,55.55436,25.45655

Answer 3

首先，您可以分块读取它，还是需要整个数据框？这将有很大帮助。

import pandas as pd

row_count = 1000
for chunk in pd.read_csv(filename, chunksize=row_count): 
    print(chunk.head()) # process it

您是否考虑过将CSV并将其放入SQL数据库中？会加快速度。您将能够为列建立索引，通过SQL进行基本聚合，并使用简单的pd.read_sql将所需的子样本放入Pandas中，以进行更复杂的处理。您将能够使用SQL db更快地进行计算。其次，您有多少RAM？

Answer 4

<pre><code>
import pandas as pd
full=pd.read_csv('path of the file')
f12311=full[full['DeviceID']==12311]
f12309=full[full['DeviceID']==12309]
f12412=full[full['DeviceID']==12412]
f12311.to_excel('path where to save the file')
f12309.to_excel('path where to save the file')
f12412.to_excel('path where to save the file')
</code></pre>

注意：只需确保“ DeviceID”列的dtype为“ int64” 如果不是int，则可以使用以下代码进行转换：

<pre><code>
full['DeviceID']=full['DeviceID'].astype('int64')
</code></pre>

读取大CSV并将其拆分为较小的块

4 个答案: