“data_science_assignment.txt”包含来自洛杉矶时报的三篇文章,采用半结构化格式。集合中的标签指示文章的开头和结尾(<doc>
和</doc>
),文章ID,文章标题和主要文字(<text>
和{{ 1}})。
我试图编写一个可以预处理和存储LA Times文章的类。
该类的方法应该作为LA Times文章集合的输入,提取集合中的每篇文章,并构造一个哈希表,其中的键是一个单词(在集合中),值是一个链表包含此单词的所有文档,以及每个文档中单词的计数。
例如,“the”一词出现在所有三篇文章中,第一篇中出现20次,第二次出现34次,第三次出现12次
期望的输出: - &gt; [1,20] - &gt; [2,34] - &gt; [3,12]
当前输出: - &gt; [1,16] - &gt; [2,16] - &gt; [3,16]
问题:在忽略</text>
标记时,我无法正确计算<text> </text>
标记之间的字数。如何改进当前代码以获得准确的字数。
<p></p>
答案 0 :(得分:1)
通过一些清理,这是我对这个问题的看法:
更改了xpath解析器和表达式
每篇文章创建1个变量
计数不正确所以需要进行分词调试
import lxml.html as LH
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from collections import Counter
doc = etree.parse("test.xml")
# Initialise a list to append results to
art1 = ""
art2 = ""
art3 = ""
i = 0
art1 = doc.xpath('string((//text)[1])')
art2 = doc.xpath('string((//text)[2])')
art3 = doc.xpath('string((//text)[3])')
dict1 = {}
dict2 = {}
dict3 = {}
words = []
words1 = []
words2 = []
words3 = []
words1.extend(art1.split())
words2.extend(art2.split())
words3.extend(art3.split())
words.extend(words1)
words.extend(words2)
words.extend(words3)
for word in words1:
#if word.lower() in art1:
# print word.lower()
#print("'%s'" % word)
if word.lower() in dict1:
dict1[word.lower()] += 1
else:
dict1[word.lower()] = 1
for word2 in words2:
#if word.lower() in art2:
# print word.lower()
if word2.lower() in dict2:
dict2[word2.lower()] += 1
else:
dict2[word2.lower()] = 1
for word3 in words3:
#if word.lower() in art3:
# print word.lower()
if word3.lower() in dict3:
dict3[word3.lower()] += 1
else:
dict3[word3.lower()] = 1
#Get words present in all the articles
print("Words present in all articles\n")
dict4 = {}
check = []
for word in words:
if word.lower() in dict1.keys() and word.lower() in dict2.keys() and word.lower() in dict3.keys():
if word.lower() not in dict4:
dict4[word.lower()] = "\t-> [1,%d] -> [2,%d] -> [3,%d]" %(dict1[word.lower()],dict2[word.lower()],dict3[word.lower()])
for k,v in sorted(dict4.items()):
print(k,v)
print("\n\nWords present in articles 1,2\n")
dict5 = {}
# #get words present in only first two articles
for word in words:
if word.lower() in dict1.keys() and word.lower() in dict2.keys() and word.lower() not in dict3.keys():
if word not in dict5:
dict5[word.lower()] = "\t-> [1,%d] -> [2,%d]" %(dict1[word.lower()],dict2[word.lower()])
for k,v in sorted(dict5.items()):
print(k,v)
结果:
<!-- language: lang-none -->
Words present in all articles
a -> [1,27] -> [2,4] -> [3,23]
all -> [1,1] -> [2,2] -> [3,3]
an -> [1,6] -> [2,1] -> [3,3]
and -> [1,34] -> [2,3] -> [3,51]
as -> [1,6] -> [2,1] -> [3,5]
at -> [1,4] -> [2,3] -> [3,5]
be -> [1,4] -> [2,1] -> [3,7]
by -> [1,6] -> [2,2] -> [3,8]
for -> [1,7] -> [2,5] -> [3,9]
in -> [1,26] -> [2,3] -> [3,31]
is -> [1,16] -> [2,1] -> [3,12]
of -> [1,56] -> [2,6] -> [3,54]
one -> [1,4] -> [2,1] -> [3,1]
so -> [1,4] -> [2,1] -> [3,1]
that -> [1,11] -> [2,1] -> [3,16]
the -> [1,94] -> [2,12] -> [3,65]
their -> [1,1] -> [2,2] -> [3,6]
then -> [1,1] -> [2,1] -> [3,1]
these -> [1,1] -> [2,2] -> [3,4]
to -> [1,22] -> [2,3] -> [3,35]
with -> [1,7] -> [2,1] -> [3,4]
Words present in articles 1,2
accident. -> [1,1] -> [2,1]
entire -> [1,1] -> [2,1]
from -> [1,1] -> [2,1]
story -> [1,3] -> [2,1]