Python pandas,从文件中提取数据

时间:2015-01-13 07:06:38

标签: python pandas text-files

我已经阅读了所有Pandas文档,但我认为我需要一个实际的例子来理解。

我有这个带有我所有sql数据的.TXT文件。

  

INSERT INTO jos_users VALUES('4065','lel lel','joel',   'chazaa @ frame.com','d0c9f71c7bc8c9','Membre','0','0','2',   '2013-01-31 17:15:29','2014-12-10 11:29:13','','{}');

     

INSERT INTO jos_users VALUES('4066','jame lea','jamal',   'jamal.stan@frame.com','d0c9f71c7774c9','Membre','0','0','2',   '2012-11-31 08:15:29','2012-12-10 12:29:13','','{}');

(大约17.000行),我的.txt文件中没有任何列名。

我想要实现的目标:

  1. 自己创建列
  2. 根据列重新排列内容(例如,我想选择第1列并显示它)
  3. 我的代码现在,显示垃圾:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    pd.set_option('display.mpl_style', 'default') 
    plt.rcParams['figure.figsize'] = (15, 5)
    
    
    df = pd.read_csv('2.txt', sep=',', na_values=['g'], error_bad_lines=False)
    
    print df
    

2 个答案:

答案 0 :(得分:0)

好的,这是我刚刚敲了一个示例脚本,我不是说这是清理SQL脚本最有效的方法,理想情况下,如果你有权访问原始数据库那么你应该能够导出它作为一个csv。

无论如何,以下操作是打开文本文件并删除insert into,打开和关闭大括号,引用字符(不是必需的,但我更喜欢这种风格)和任何无关的空格。

In [91]:

with open(r'c:\data\clean.csv', 'wt') as clean:
    with open(r'c:\data\temp sql.txt', 'rt') as f:
        for line in f:
            if len(line) > 0:
                l = line.replace('INSERT INTO jos_users VALUES (', '')
                l = l.replace(", '", ",'")
                l = l.replace("'",'')
                l = l.replace(');','')
                clean.write(l)
clean.close()
f.close()
# read the file back in, there is no header so you need to specify this 
df = pd.read_csv(r'c:\data\clean.csv', header=None)
df
Out[91]:
     0         1      2                     3               4       5   6   \
0  4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre   0   
1  4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre   0   

   7   8                    9                    10  11  12  
0   0   2  2013-01-31 17:15:29  2014-12-10 11:29:13 NaN  {}  
1   0   2  2012-11-31 08:15:29  2012-12-10 12:29:13 NaN  {}  

答案 1 :(得分:0)

编辑:以下方法比将更改的数据写入文件慢得多,然后使用read_csv()将更改的数据读入DataFrame。对于一个34,000行文件,它需要大约23分钟v .~3秒。

import pandas as pd
import numpy as np
import re

pd.set_option('display.width', 1000)
#Pre-allocate all the space needed by your DataFrame:
df = pd.DataFrame(index=np.arange(18000), columns=np.arange(13))

pattern = r""" #Find all single quoted sequences:
    '          #Match a single quote, followed by...
    (          #(start a capture group)
      [^']*    #not a single quote, 0 or more times, followed by...
    )          #(end the capture group)
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

f = open('data.txt')

for i, line in enumerate(f):
    data = re.findall(regex, line)  #findall() returns a list of all the strings that matched the pattern's capture group 

    if data:
        df.iloc[i] = data  #insert data at row i

print df

--output:--
         0         1      2                     3               4       5    6    7    8                    9                    10   11   12
0      4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre    0    0    2  2013-01-31 17:15:29  2014-12-10 11:29:13        {}
1      4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre    0    0    2  2012-11-31 08:15:29  2012-12-10 12:29:13        {}
...
...
1798    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN
1799    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN

re.findall(pattern,string,flags = 0)
返回字符串中pattern的所有非重叠匹配,作为字符串列表。从左到右扫描字符串,并按找到的顺序返回匹配项。如果模式中存在一个或多个组,则返回组列表;如果模式有多个组,这将是一个元组列表。结果中包含空匹配,除非它们触及另一场比赛的开头。

https://docs.python.org/2/library/re.html#re.findall

将更改的数据写入文件,然后使用read_csv()读取它:

import pandas as pd
import numpy as np
import re
import time

pd.set_option('display.width', 1000)

pattern = r"""
    '          #Match a single quote, followed by...
    (          #start a capture group.
      [^']*    #not a quote, 0 or more times, followed by...
    )          #end capture group.
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

fin = open('data2.txt')  #The two insert statemetns in the op, repeated 17,000 times
fout = open('data.csv', 'w')

results = {}

for line in fin: 
    data = re.findall(regex, line)  

    if data:
        print(*data, file=fout, sep=',')  

fin.close()
fout.close()

df = pd.read_csv(
    'data.csv', 
    sep=',', 
    header=None,
    names=np.arange(13),  #column names: 0 - 12
)


print(df)