Question

我已经阅读了所有Pandas文档，但我认为我需要一个实际的例子来理解。

我有这个带有我所有sql数据的.TXT文件。

INSERT INTO jos_users VALUES（'4065'，'lel lel'，'joel'，   'chazaa @ frame.com'，'d0c9f71c7bc8c9'，'Membre'，'0'，'0'，'2'，   '2013-01-31 17:15:29'，'2014-12-10 11:29:13'，''，'{}'）;

INSERT INTO jos_users VALUES（'4066'，'jame lea'，'jamal'，   'jamal.stan@frame.com'，'d0c9f71c7774c9'，'Membre'，'0'，'0'，'2'，   '2012-11-31 08:15:29'，'2012-12-10 12:29:13'，''，'{}'）;

（大约17.000行），我的.txt文件中没有任何列名。

我想要实现的目标：

自己创建列
根据列重新排列内容（例如，我想选择第1列并显示它）

我的代码现在，显示垃圾：

import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default') 
plt.rcParams['figure.figsize'] = (15, 5)


df = pd.read_csv('2.txt', sep=',', na_values=['g'], error_bad_lines=False)

print df

Answer 1

好的，这是我刚刚敲了一个示例脚本，我不是说这是清理SQL脚本最有效的方法，理想情况下，如果你有权访问原始数据库那么你应该能够导出它作为一个csv。

无论如何，以下操作是打开文本文件并删除insert into，打开和关闭大括号，引用字符（不是必需的，但我更喜欢这种风格）和任何无关的空格。

In [91]:

with open(r'c:\data\clean.csv', 'wt') as clean:
    with open(r'c:\data\temp sql.txt', 'rt') as f:
        for line in f:
            if len(line) > 0:
                l = line.replace('INSERT INTO jos_users VALUES (', '')
                l = l.replace(", '", ",'")
                l = l.replace("'",'')
                l = l.replace(');','')
                clean.write(l)
clean.close()
f.close()
# read the file back in, there is no header so you need to specify this 
df = pd.read_csv(r'c:\data\clean.csv', header=None)
df
Out[91]:
     0         1      2                     3               4       5   6   \
0  4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre   0   
1  4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre   0   

   7   8                    9                    10  11  12  
0   0   2  2013-01-31 17:15:29  2014-12-10 11:29:13 NaN  {}  
1   0   2  2012-11-31 08:15:29  2012-12-10 12:29:13 NaN  {}

Answer 2

编辑：以下方法比将更改的数据写入文件慢得多，然后使用read_csv（）将更改的数据读入DataFrame。对于一个34,000行文件，它需要大约23分钟v .~3秒。

import pandas as pd
import numpy as np
import re

pd.set_option('display.width', 1000)
#Pre-allocate all the space needed by your DataFrame:
df = pd.DataFrame(index=np.arange(18000), columns=np.arange(13))

pattern = r""" #Find all single quoted sequences:
    '          #Match a single quote, followed by...
    (          #(start a capture group)
      [^']*    #not a single quote, 0 or more times, followed by...
    )          #(end the capture group)
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

f = open('data.txt')

for i, line in enumerate(f):
    data = re.findall(regex, line)  #findall() returns a list of all the strings that matched the pattern's capture group 

    if data:
        df.iloc[i] = data  #insert data at row i

print df

--output:--
         0         1      2                     3               4       5    6    7    8                    9                    10   11   12
0      4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre    0    0    2  2013-01-31 17:15:29  2014-12-10 11:29:13        {}
1      4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre    0    0    2  2012-11-31 08:15:29  2012-12-10 12:29:13        {}
...
...
1798    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN
1799    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN

re.findall（pattern，string，flags = 0）
返回字符串中pattern的所有非重叠匹配，作为字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。如果模式中存在一个或多个组，则返回组列表;如果模式有多个组，这将是一个元组列表。结果中包含空匹配，除非它们触及另一场比赛的开头。

https://docs.python.org/2/library/re.html#re.findall

将更改的数据写入文件，然后使用read_csv（）读取它：

import pandas as pd
import numpy as np
import re
import time

pd.set_option('display.width', 1000)

pattern = r"""
    '          #Match a single quote, followed by...
    (          #start a capture group.
      [^']*    #not a quote, 0 or more times, followed by...
    )          #end capture group.
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

fin = open('data2.txt')  #The two insert statemetns in the op, repeated 17,000 times
fout = open('data.csv', 'w')

results = {}

for line in fin: 
    data = re.findall(regex, line)  

    if data:
        print(*data, file=fout, sep=',')  

fin.close()
fout.close()

df = pd.read_csv(
    'data.csv', 
    sep=',', 
    header=None,
    names=np.arange(13),  #column names: 0 - 12
)


print(df)

Python pandas，从文件中提取数据

2 个答案: