如何计算每行中特定字符串之后的文本文件中的唯一单词?

时间:2018-07-25 19:01:54

标签: python regex file count unique

Python-noob在这里:

我有一个文本文件,如下所示:

{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'} 
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'} 
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'} 
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'} 
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'} 

我希望能够计算出唯一字符的名称并计算出每个字符的出现次数。如上所示:忽略每行中的所有内容,直到字符串'Character':,因此仅考虑字符名称。

到目前为止,在尝试了许多其他方法(包括RegEx)之后,我有了这段代码,但是没有想要的结果(它可以打印并计算所有内容):

import re
from collections import Counter
import tkFileDialog

filename = tkFileDialog.askopenfilename()

f = open(filename, "r")

lines = f.readlines()

f.close()


cnt = Counter()

for line in lines:
    cnt[line.split("'Character':", 2)] +=1

print cnt
print sum(cnt.values())

最佳输出如下:

Jonathan: 3
Joyce: 2
Will: 1
Nancy: 1
Steve: 1

任何帮助或提示将不胜感激!

编辑:上面的文本文件是从.xmi文件生成的,该文件具有难以理解的信息。正如我在对以下答案之一的评论中所提到的:这是我尝试以可视方式表示所需组合信息的第一方法。除了在文本文件中可以使用之外,我不确定是否有更好的方法来表示此类数据。为此创建一个新的.xmi文件吗?

因此,根据要求,这是将.xmi文件生成为文本文件的代码:

# coding: utf-8

# In[ ]:

import xml.etree.cElementTree as ET
from xml.etree.ElementTree import (Element, ElementTree, SubElement, Comment, tostring)

ET.register_namespace("pos","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore")
ET.register_namespace("tcas","http:///uima/tcas.ecore")
ET.register_namespace("xmi","http://www.omg.org/XMI")
ET.register_namespace("cas","http:///uima/cas.ecore")
ET.register_namespace("tweet","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore")
ET.register_namespace("morph","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore")
ET.register_namespace("dependency","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore")
ET.register_namespace("type5","http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore")
ET.register_namespace("type6","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore")
ET.register_namespace("type2","http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore")
ET.register_namespace("type3","http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore")
ET.register_namespace("type4","http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore")
ET.register_namespace("type","http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore")
ET.register_namespace("constituent","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore")
ET.register_namespace("chunk","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore")
ET.register_namespace("custom","http:///webanno/custom.ecore")

def sofa(annotation):
    f = open(annotation)
    tree = ET.ElementTree(file=f)
    root = tree.getroot()

    node = root.find("{http:///uima/cas.ecore}Sofa") # we remove cas:View
    return node.attrib['sofaString']

path ="valhalla.xmi"
with open(path, 'r', encoding="utf-8") as filename:
    tree = ET.ElementTree(file=filename)
    root = tree.getroot()

ns = {'emospan': 'http:///webanno/custom.ecore', 
      'id':'http://www.omg.org/XMI',
      'relspan': 'http:///webanno/custom.ecore',
      'sentence': 'http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore',
      'annotator': "http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore"}
my_id = '{http://www.omg.org/XMI}id'


top = Element('corpus', encoding="utf-8") 
text = sofa(path).replace("\n"," ")

def stimcount():
    with open('results.txt', 'w') as f:
        for rel_node in root.findall("emospan:CharacterRelation",ns):
            if rel_node.attrib['Relation']=="Stimulus":
                source = rel_node.attrib['Governor']
                target = rel_node.attrib['Dependent']
                for span_node in root.findall("emospan:CharacterEmotion",ns):
                    if span_node.attrib[my_id]==source:

                        print(span_node.attrib['Emotion'])

                    if span_node.attrib[my_id]==target:
                        print(span_node.attrib)
                        print(span_node.attrib, file=f)

4 个答案:

答案 0 :(得分:2)

这是一个正则表达式解决方案:

file_stuff = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""

import re
from collections import Counter

r = re.compile("(?<=\'Character\'\:\s\')\w+(?=\')")
# EDIT: use "(?<=\'Character\'\:\s\')(.+)(?=\')" to match names with quotes...
# or other characters, as pointed out in comments.
print(Counter(r.findall(file_stuff)))
# Counter({'Jonathan': 3, 'Joyce': 2, 'Will': 1, 'Nancy': 1, 'Steve': 1})

答案 1 :(得分:0)

使用astcollections模块

例如:

import ast
from collections import defaultdict

d = defaultdict(int)
with open(filename) as infile:
    for line in infile:
        val = ast.literal_eval(line)
        d[val["Character"]] += 1
print(d)

输出:

defaultdict(<type 'int'>, {'Will': 1, 'Steve': 1, 'Jonathan': 3, 'Nancy': 1, 'Joyce': 2})

答案 2 :(得分:0)

您的原始文本文件非常可悲,因为它似乎包含以文本格式编写的python字典的表示形式,每行一个!

这是生成文本数据文件的非常糟糕的方法。您应该更改生成该文件的代码,以生成另一种格式,例如csv或json文件,而不是天真地将字符串表示形式写入文本文件。如果您使用csv或json,那么您已经编写并测试了库来帮助您解析内容并轻松提取每个元素。

如果仍然需要,可以使用ast.literal_eval在每一行上实际运行代码:

import ast
import collections
with open(filename) as infile:
     print(collections.Counter(ast.literal_eval(line)['Character'] for line in infile))

编辑:现在,您添加了文件生成示例,我建议您使用其他格式,例如json:

def stimcount():
    results = []
    for rel_node in root.findall("emospan:CharacterRelation",ns):
        if rel_node.attrib['Relation']=="Stimulus":
            source = rel_node.attrib['Governor']
            target = rel_node.attrib['Dependent']
            for span_node in root.findall("emospan:CharacterEmotion",ns):
                if span_node.attrib[my_id]==source:

                    print(span_node.attrib['Emotion'])

                if span_node.attrib[my_id]==target:
                    print(span_node.attrib)
                    results.append(span_node.attrib)

    with open('results.txt', 'w') as f:
        json.dump(results, f)

然后,读取数据的代码可能很简单:

with open('results.txt') as f:
    results = json.load(f)
r = collections.Counter(d['Character'] for d in results)
for n, (ch, number) in enumerate(r.items()): 
    print('{} - {}, {}'.format(n, ch, number))

另一个选择是使用csv格式。它允许您指定有趣列的列表,而忽略其余列:

def stimcount():
    with open('results.txt', 'w') as f:
        cf = csv.DictWriter(f, ['begin', 'end', 'Character'], extrasaction='ignore')
        cf.writeheader()
        for rel_node in root.findall("emospan:CharacterRelation",ns):
            if rel_node.attrib['Relation']=="Stimulus":
                source = rel_node.attrib['Governor']
                target = rel_node.attrib['Dependent']
                for span_node in root.findall("emospan:CharacterEmotion",ns):
                    if span_node.attrib[my_id]==source:

                        print(span_node.attrib['Emotion'])

                    if span_node.attrib[my_id]==target:
                        print(span_node.attrib)
                        cf.writerow(span_node.attrib)

然后轻松阅读:

with open('results.txt') as f:
    cf = csv.DictReader(f)
    r = collections.Counter(d['Character'] for d in cf)
    for n, (ch, number) in enumerate(r.items()): 
        print('{} - {}, {}'.format(n, ch, number))

答案 3 :(得分:0)

如果需要,您也可以有一个pandas解决方案......

txt = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""

import pandas as pd

# replace the StringIO-stuff by your file-path
df = pd.read_table(StringIO(txt), sep="'Character': '", header=None, usecols=[1])
            1
0  Jonathan'}
1  Jonathan'}
2     Joyce'}
3     Joyce'}
4      Will'}
5  Jonathan'}
6     Nancy'}
7     Steve'}

df = df[1].str.split('\'', expand=True)
          0  1
0  Jonathan  }
1  Jonathan  }
2     Joyce  }
3     Joyce  }
4      Will  }
5  Jonathan  }
6     Nancy  }
7     Steve  }

df.groupby(0).count()
          1
0          
Jonathan  3
Joyce     2
Nancy     1
Steve     1
Will      1

这个想法是将文件读为sep后面的两列'Character': ',然后仅导入第二列(usecols)。
然后在split再次'
其余的是普通的groupby / count