Python 2.7无法在使用Regex

时间:2016-02-04 06:56:11

标签: python regex

我已经尝试了很多基于堆栈溢出思路的方法:

How to write header row with csv.DictWriter?

Writing a Python list of lists to a csv file

csv.DictWriter -- TypeError: __init__() takes at least 3 arguments (4 given)

Python: tuple indices must be integers, not str when selecting from mysql table

https://docs.python.org/2/library/csv.html

python csv write only certain fieldnames, not all

Python 2.6文本处理和

Why is DictWriter not Writing all rows in my Dictreader instance?

我尝试映射读者和作者字段名和特殊标题参数。

我从一些很棒的多列SO文章中构建了第二层测试:

代码如下

import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['a', 'b', 'd', 'g']
with open('in2.csv', 'rb') as csvfile:
    with open('out2.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile)
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['d'] = re.findall(t, row['d'])
            print(row['a'], row['b'], row['d'], row['g'])
            writer.writerow(row)

输入数据是:

a, b, c, d, e, f, g, h 

<* number 1 *>, <* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *> 

<* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>, <* number 9 *> 

输出数据是:

['a', 'b', 'd', 'g' ] 

('<* number 1 *>', '<* number 2 *>', ' number 4 ', <* number 7 *>) 

('<* number 2 *>', '<* number 3 *>', ' number 5 ', <* number 8 *>) 

完全符合要求。

但是当我使用一个粗糙的数据集,其中包含空白,双引号以及大写和小写字母的混合时,打印工作在行级别,但写作并不完全有效。

完全上,我已经能够(我知道我在这里处于史诗般的失败模式)实际编写了一行具有挑战性的数据,但不是在那个实例中,一个标题和多行。非常蹩脚,我可以通过我读过的所有有才华的文章来克服这个障碍。

所有四列都失败,出现键错误或&#34; TypeError:元组索引必须是整数,而不是str&#34;

我显然不明白如何掌握Python需要实现的目标。

高级别是:读入包含七个观察/列的文本文件。只用四列写出来;在一列上执行正则表达式。确保写出每个新形成的行,而不是原始行。

我可能需要一个更友好的全局临时表来读取行,更新行,然后将行写入文件。

也许我要求太多的Python架构来协调DictReader和DictWriter来读取数据,过滤到四列,用正则表达式更新第四列,然后用更新的四个元组写出文件

此时此刻,我没有时间调查解析器。我想最终更详细,因为每次发布Python(现在是2.7,后来的3.x)解析器看起来都很方便。

再次,为方法的复杂性和我对Python基础的理解缺乏道歉。在R语言中,我的缺点就是理解S4级别的编码,而不仅仅是S3级别。

这里有更接近失败的数据,抱歉 - 我需要显示标题的设置方式,文件行的格式是如何用单引号引号格式化整个行的引号以及日期的方式格式化,但未引用:

    stuff_type|stuff_date|stuff_text
""cool stuff"|01-25-2015|""the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star"""
""cool stuff"|05-13-2014|""the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star"""
""great big stuff"|12-7-2014|"the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star"""
""nice stuff"|2-22-2013|""the text stuff <*to test a fourth ,*> to find a way to extract all text that is <*included in doubly special tags*> less than star and greater than star"""

stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star
cool stuff,5/13/2014,the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star
great big stuff,12/7/2014,the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star
nice stuff,2/22/2013,the text stuff <*to test a fourth *> to find a way to extract all text that is <*included in really special tags*> less or greater than star

我打算重新测试一下,但Spyder的更新让我的Python控制台今天早上崩溃了。 Ugghh。使用vanilla Python,上面的测试数据失败,并带有以下代码......无需执行写入步骤...甚至无法在此处打印...可能需要方言中的QUOTES.NONE。

import csv
import re 
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
with open('C:/Temp/in3.csv', 'rb') as csvfile:
    with open('C:/Temp/out3.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile)
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['stuff_text'] = re.findall(t, row['stuff_text'])
            print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
            writer.writerow(row)

错误:

无法通过此处的剪切工具图片....抱歉

KeyError:&#39; stuff_text&#39;

好的:它可能在列的引用和分离中:上面没有引号而没有KeyError打印的数据现在正确地写入文件:我可能必须先从引号字符中清理文件,然后才能用正则表达式。任何想法都将不胜感激。

好问题@Andrea Corbellini

如果我手动删除了引号,则上面的代码会生成以下输出:

stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,"['to test', 'included in special tags']"
cool stuff,5/13/2014,"['to test a second', 'included in extra special tags']"
great big stuff,12/7/2014,"['to test a third', 'included in very special tags']"
nice stuff,2/22/2013,"['to test a fourth ', 'included in really special tags']"

这就是我想要的输出。所以,谢谢你的懒惰&#34;问题---我应该把这第二个输出作为后续的懒惰。

同样,在不删除多组引号的情况下,我有KeyError:&#39; stuff_type&#39;。我很抱歉我试图从带有错误的Python的屏幕截图中插入图像,但还没有弄清楚如何在SO中执行此操作。我使用上面的图像部分,但这似乎指向一个可能上传到SO的文件?没插入?

使用@ monkut的优秀输入,使用&#34;。&#34; .join或字面上的东西越来越好。

{['stuff_type', 'stuff_date', 'stuff_text']
('cool stuff', '1/25/2015', 'to test:included in special tags')
('cool stuff', '5/13/2014', 'to test a second:included in extra special tags')
('great big stuff', '12/7/2014', 'to test a third:included in very special tags')
('nice stuff', '2/22/2013', 'to test a fourth :included in really special tags')}

import csv
import re 
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
with open('C:/Python/in3.txt', 'rb') as csvfile:
    with open('C:/Python/out5.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile, dialect='piper')
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
            print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
            writer.writerow(row)

错误路径如下:

runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
['stuff_type', 'stuff_date', 'stuff_text']
('""cool stuff"', '01-25-2015', 'to test')
Traceback (most recent call last):

  File "<ipython-input-3-832ce30e0de3>", line 1, in <module>
    runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')

  File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py", line 20, in <module>
    row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))

  File "C:\Users\Methody\Anaconda\lib\re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)

TypeError: expected string or buffer

在处理正则表达式findall之前,我会找到一种更强大的方法来清理和删除引号。可能是row = string.remove(引用空白)。

1 个答案:

答案 0 :(得分:1)

我认为findall会返回一个列表,因为dictwriter想要一个字符串值,所以可能会搞砸了。

row['d'] = re.findall(t, row['d'])

您可以使用.join将结果转换为单个字符串值:

row['d'] = ":".join(re.findall(t, row['d']))

其中,此处的值与“:”结合使用。但是,正如您所提到的,您可能需要更多地清理这些值...

您提到使用已编译的正则表达式对象时出现问题。 以下是如何使用已编译的正则表达式对象的示例:

import re
t = re.compile('<\*(.*?)\*>')
text= ('''cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that'''
       ''' is <*included in special tags*> less than star and greater than star''')
result = t.findall(text)

这应该将以下内容返回result

  

['测试','包含在特殊标签中']