Question

“data_science_assignment.txt”包含来自洛杉矶时报的三篇文章，采用半结构化格式。集合中的标签指示文章的开头和结尾（<doc>和</doc>），文章ID，文章标题和主要文字（<text>和{{ 1}}）。

我试图编写一个可以预处理和存储LA Times文章的类。

该类的方法应该作为LA Times文章集合的输入，提取集合中的每篇文章，并构造一个哈希表，其中的键是一个单词（在集合中），值是一个链表包含此单词的所有文档，以及每个文档中单词的计数。

例如，“the”一词出现在所有三篇文章中，第一篇中出现20次，第二次出现34次，第三次出现12次

期望的输出： - ＆gt; [1,20] - ＆gt; [2,34] - ＆gt; [3,12]

当前输出： - ＆gt; [1,16] - ＆gt; [2,16] - ＆gt; [3,16]

问题：在忽略</text>标记时，我无法正确计算<text> </text>标记之间的字数。如何改进当前代码以获得准确的字数。

<p></p>

Answer 1

通过一些清理，这是我对这个问题的看法：

更改了xpath解析器和表达式
每篇文章创建1个变量
计数不正确所以需要进行分词调试

import lxml.html as LH
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET

from collections import Counter

doc = etree.parse("test.xml")
# Initialise a list to append results to
art1 = ""
art2 = ""
art3 = ""
i = 0

art1 = doc.xpath('string((//text)[1])')
art2 = doc.xpath('string((//text)[2])')
art3 = doc.xpath('string((//text)[3])')

dict1 = {}
dict2 = {}
dict3 = {}
words = []
words1 = []
words2 = []
words3 = []
words1.extend(art1.split())
words2.extend(art2.split())
words3.extend(art3.split())
words.extend(words1)
words.extend(words2)
words.extend(words3)

for word in words1:
    #if word.lower() in art1:
        # print word.lower()
    #print("'%s'" % word)
    if word.lower() in dict1:
        dict1[word.lower()] += 1
    else:
        dict1[word.lower()] = 1

for word2 in words2:
    #if word.lower() in art2:
    # print word.lower()
    if word2.lower() in dict2:
        dict2[word2.lower()] += 1
    else:
        dict2[word2.lower()] = 1

for word3 in words3:
    #if word.lower() in art3:
        # print word.lower()
    if word3.lower() in dict3:
        dict3[word3.lower()] += 1
    else:
        dict3[word3.lower()] = 1

#Get words present in all the articles
print("Words present in all articles\n")
dict4 = {}
check = []
for word in words:
    if word.lower() in dict1.keys() and word.lower() in dict2.keys() and word.lower() in dict3.keys():
        if word.lower() not in dict4:
            dict4[word.lower()] = "\t-> [1,%d] -> [2,%d] -> [3,%d]" %(dict1[word.lower()],dict2[word.lower()],dict3[word.lower()])

for k,v in sorted(dict4.items()):
        print(k,v)

print("\n\nWords present in articles 1,2\n")
dict5 = {}
# #get words present in only first two articles
for word in words:
    if word.lower() in dict1.keys() and word.lower() in dict2.keys() and word.lower() not in dict3.keys():
        if word not in dict5:
            dict5[word.lower()] = "\t-> [1,%d] -> [2,%d]" %(dict1[word.lower()],dict2[word.lower()])

for k,v in sorted(dict5.items()):
        print(k,v)

结果：

<!-- language: lang-none -->

Words present in all articles

a       -> [1,27] -> [2,4] -> [3,23]
all     -> [1,1] -> [2,2] -> [3,3]
an      -> [1,6] -> [2,1] -> [3,3]
and     -> [1,34] -> [2,3] -> [3,51]
as      -> [1,6] -> [2,1] -> [3,5]
at      -> [1,4] -> [2,3] -> [3,5]
be      -> [1,4] -> [2,1] -> [3,7]
by      -> [1,6] -> [2,2] -> [3,8]
for     -> [1,7] -> [2,5] -> [3,9]
in      -> [1,26] -> [2,3] -> [3,31]
is      -> [1,16] -> [2,1] -> [3,12]
of      -> [1,56] -> [2,6] -> [3,54]
one     -> [1,4] -> [2,1] -> [3,1]
so      -> [1,4] -> [2,1] -> [3,1]
that    -> [1,11] -> [2,1] -> [3,16]
the     -> [1,94] -> [2,12] -> [3,65]
their   -> [1,1] -> [2,2] -> [3,6]
then    -> [1,1] -> [2,1] -> [3,1]
these   -> [1,1] -> [2,2] -> [3,4]
to      -> [1,22] -> [2,3] -> [3,35]
with    -> [1,7] -> [2,1] -> [3,4]


Words present in articles 1,2

accident.       -> [1,1] -> [2,1]
entire  -> [1,1] -> [2,1]
from    -> [1,1] -> [2,1]
story   -> [1,3] -> [2,1]

字数在XML标记之间计数

1 个答案: