我有一组文档ID
s(keys.csv),我用它来从文档源获取一组文本文档。我想将所有这些文本文档收集到语料库中以进行进一步分析(如余弦相似度)。
我使用下面的代码将每个文本文档附加到语料库中,但我不确定这是否可行。有没有更好的方法来创建这些文本文档的语料库?
keys = pandas.read_csv(keys.csv)
for i in keys:
ID = i
doc = function_to_get_document(ID)
corpus = corpus.append(doc)
答案 0 :(得分:1)
如果csv
列IDcol
的唯一ID
列list comprehension
使用list
,则输出为corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]
:
print (pd.read_csv('keys.csv'))
IDcol
0 1
1 2
2 3
def function_to_get_document(x):
return x + 1
corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]
print (corpus)
[2, 3, 4]
样品:
[root@marcel ~]# dig marcel.home
; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3.2 <<>> marcel.home
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23565
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 13, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;marcel.home. IN A
;; ANSWER SECTION:
**marcel.home. 0 IN A 192.168.1.23**
;; AUTHORITY SECTION:
. 14235 IN NS k.root-servers.net.
. 14235 IN NS m.root-servers.net.
. 14235 IN NS h.root-servers.net.
. 14235 IN NS l.root-servers.net.
. 14235 IN NS b.root-servers.net.
. 14235 IN NS g.root-servers.net.
. 14235 IN NS f.root-servers.net.
. 14235 IN NS e.root-servers.net.
. 14235 IN NS d.root-servers.net.
. 14235 IN NS a.root-servers.net.
. 14235 IN NS j.root-servers.net.
. 14235 IN NS i.root-servers.net.
. 14235 IN NS c.root-servers.net.
;; Query time: 7 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: mar. févr. 21 10:06:07 CET 2017
;; MSG SIZE rcvd: 267
[root@marcel ~]# host 192.168.1.23
**Host 23.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN)**