Python-noob在这里:
我有一个文本文件,如下所示:
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}
等
我希望能够计算出唯一字符的名称并计算出每个字符的出现次数。如上所示:忽略每行中的所有内容,直到字符串'Character':,因此仅考虑字符名称。
到目前为止,在尝试了许多其他方法(包括RegEx)之后,我有了这段代码,但是没有想要的结果(它可以打印并计算所有内容):
import re
from collections import Counter
import tkFileDialog
filename = tkFileDialog.askopenfilename()
f = open(filename, "r")
lines = f.readlines()
f.close()
cnt = Counter()
for line in lines:
cnt[line.split("'Character':", 2)] +=1
print cnt
print sum(cnt.values())
最佳输出如下:
Jonathan: 3
Joyce: 2
Will: 1
Nancy: 1
Steve: 1
任何帮助或提示将不胜感激!
编辑:上面的文本文件是从.xmi文件生成的,该文件具有难以理解的信息。正如我在对以下答案之一的评论中所提到的:这是我尝试以可视方式表示所需组合信息的第一方法。除了在文本文件中可以使用之外,我不确定是否有更好的方法来表示此类数据。为此创建一个新的.xmi文件吗?
因此,根据要求,这是将.xmi文件生成为文本文件的代码:
# coding: utf-8
# In[ ]:
import xml.etree.cElementTree as ET
from xml.etree.ElementTree import (Element, ElementTree, SubElement, Comment, tostring)
ET.register_namespace("pos","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore")
ET.register_namespace("tcas","http:///uima/tcas.ecore")
ET.register_namespace("xmi","http://www.omg.org/XMI")
ET.register_namespace("cas","http:///uima/cas.ecore")
ET.register_namespace("tweet","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore")
ET.register_namespace("morph","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore")
ET.register_namespace("dependency","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore")
ET.register_namespace("type5","http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore")
ET.register_namespace("type6","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore")
ET.register_namespace("type2","http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore")
ET.register_namespace("type3","http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore")
ET.register_namespace("type4","http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore")
ET.register_namespace("type","http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore")
ET.register_namespace("constituent","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore")
ET.register_namespace("chunk","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore")
ET.register_namespace("custom","http:///webanno/custom.ecore")
def sofa(annotation):
f = open(annotation)
tree = ET.ElementTree(file=f)
root = tree.getroot()
node = root.find("{http:///uima/cas.ecore}Sofa") # we remove cas:View
return node.attrib['sofaString']
path ="valhalla.xmi"
with open(path, 'r', encoding="utf-8") as filename:
tree = ET.ElementTree(file=filename)
root = tree.getroot()
ns = {'emospan': 'http:///webanno/custom.ecore',
'id':'http://www.omg.org/XMI',
'relspan': 'http:///webanno/custom.ecore',
'sentence': 'http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore',
'annotator': "http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore"}
my_id = '{http://www.omg.org/XMI}id'
top = Element('corpus', encoding="utf-8")
text = sofa(path).replace("\n"," ")
def stimcount():
with open('results.txt', 'w') as f:
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:
print(span_node.attrib['Emotion'])
if span_node.attrib[my_id]==target:
print(span_node.attrib)
print(span_node.attrib, file=f)
答案 0 :(得分:2)
这是一个正则表达式解决方案:
file_stuff = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""
import re
from collections import Counter
r = re.compile("(?<=\'Character\'\:\s\')\w+(?=\')")
# EDIT: use "(?<=\'Character\'\:\s\')(.+)(?=\')" to match names with quotes...
# or other characters, as pointed out in comments.
print(Counter(r.findall(file_stuff)))
# Counter({'Jonathan': 3, 'Joyce': 2, 'Will': 1, 'Nancy': 1, 'Steve': 1})
答案 1 :(得分:0)
使用ast
和collections
模块
例如:
import ast
from collections import defaultdict
d = defaultdict(int)
with open(filename) as infile:
for line in infile:
val = ast.literal_eval(line)
d[val["Character"]] += 1
print(d)
输出:
defaultdict(<type 'int'>, {'Will': 1, 'Steve': 1, 'Jonathan': 3, 'Nancy': 1, 'Joyce': 2})
答案 2 :(得分:0)
您的原始文本文件非常可悲,因为它似乎包含以文本格式编写的python字典的表示形式,每行一个!
这是生成文本数据文件的非常糟糕的方法。您应该更改生成该文件的代码,以生成另一种格式,例如csv或json文件,而不是天真地将字符串表示形式写入文本文件。如果您使用csv或json,那么您已经编写并测试了库来帮助您解析内容并轻松提取每个元素。
如果仍然需要,可以使用ast.literal_eval在每一行上实际运行代码:
import ast
import collections
with open(filename) as infile:
print(collections.Counter(ast.literal_eval(line)['Character'] for line in infile))
编辑:现在,您添加了文件生成示例,我建议您使用其他格式,例如json:
def stimcount():
results = []
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:
print(span_node.attrib['Emotion'])
if span_node.attrib[my_id]==target:
print(span_node.attrib)
results.append(span_node.attrib)
with open('results.txt', 'w') as f:
json.dump(results, f)
然后,读取数据的代码可能很简单:
with open('results.txt') as f:
results = json.load(f)
r = collections.Counter(d['Character'] for d in results)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))
另一个选择是使用csv格式。它允许您指定有趣列的列表,而忽略其余列:
def stimcount():
with open('results.txt', 'w') as f:
cf = csv.DictWriter(f, ['begin', 'end', 'Character'], extrasaction='ignore')
cf.writeheader()
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:
print(span_node.attrib['Emotion'])
if span_node.attrib[my_id]==target:
print(span_node.attrib)
cf.writerow(span_node.attrib)
然后轻松阅读:
with open('results.txt') as f:
cf = csv.DictReader(f)
r = collections.Counter(d['Character'] for d in cf)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))
答案 3 :(得分:0)
如果需要,您也可以有一个pandas
解决方案......
txt = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""
import pandas as pd
# replace the StringIO-stuff by your file-path
df = pd.read_table(StringIO(txt), sep="'Character': '", header=None, usecols=[1])
1
0 Jonathan'}
1 Jonathan'}
2 Joyce'}
3 Joyce'}
4 Will'}
5 Jonathan'}
6 Nancy'}
7 Steve'}
df = df[1].str.split('\'', expand=True)
0 1
0 Jonathan }
1 Jonathan }
2 Joyce }
3 Joyce }
4 Will }
5 Jonathan }
6 Nancy }
7 Steve }
df.groupby(0).count()
1
0
Jonathan 3
Joyce 2
Nancy 1
Steve 1
Will 1
这个想法是将文件读为sep
后面的两列'Character': '
,然后仅导入第二列(usecols
)。
然后在split
再次'
。
其余的是普通的groupby
/ count