如何使用一组文本文件创建语料库 - python?

时间:2017-02-21 09:08:27

标签: python pandas scikit-learn nlp corpus

我有一组文档ID s(keys.csv),我用它来从文档源获取一组文本文档。我想将所有这些文本文档收集到语料库中以进行进一步分析(如余弦相似度)。

我使用下面的代码将每个文本文档附加到语料库中,但我不确定这是否可行。有没有更好的方法来创建这些文本文档的语料库?

keys = pandas.read_csv(keys.csv)
for i in keys:
    ID = i
    doc = function_to_get_document(ID)
    corpus = corpus.append(doc)

1 个答案:

答案 0 :(得分:1)

如果csvIDcol的唯一IDlist comprehension使用list,则输出为corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]

print (pd.read_csv('keys.csv'))
   IDcol
0      1
1      2
2      3

def function_to_get_document(x):
    return x + 1

corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]
print (corpus)
[2, 3, 4]

样品:

[root@marcel ~]# dig marcel.home

; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3.2 <<>> marcel.home
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23565
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 13, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;marcel.home.           IN  A

;; ANSWER SECTION:
**marcel.home.      0   IN  A   192.168.1.23**

;; AUTHORITY SECTION:
.           14235   IN  NS  k.root-servers.net.
.           14235   IN  NS  m.root-servers.net.
.           14235   IN  NS  h.root-servers.net.
.           14235   IN  NS  l.root-servers.net.
.           14235   IN  NS  b.root-servers.net.
.           14235   IN  NS  g.root-servers.net.
.           14235   IN  NS  f.root-servers.net.
.           14235   IN  NS  e.root-servers.net.
.           14235   IN  NS  d.root-servers.net.
.           14235   IN  NS  a.root-servers.net.
.           14235   IN  NS  j.root-servers.net.
.           14235   IN  NS  i.root-servers.net.
.           14235   IN  NS  c.root-servers.net.

;; Query time: 7 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: mar. févr. 21 10:06:07 CET 2017
;; MSG SIZE  rcvd: 267

[root@marcel ~]# host 192.168.1.23
**Host 23.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN)**