
时间:2017-04-17 12:52:48

标签: python pandas numpy nltk spacy



doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

但是,我在sql server或excel上以表格格式存储了大量文本。它基本上有两列。第一列具有唯一标识符。第二栏有一个简短的文字。



3 个答案:

答案 0 :(得分:4)


$ cat test.tsv
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot

$ cut -f2 test.tsv
Foo bar bar
bar bar black sheep


$ python
>>> import pandas as pd
>>> pd.read_csv('test.tsv', delimiter='\t')
   DocID                 Text WhateverAnnotations
0      1          Foo bar bar         dot dot dot
1      2  bar bar black sheep     dot dot dot dot
>>> df = pd.read_csv('test.tsv', delimiter='\t')
>>> df['Text']
0            Foo bar bar
1    bar bar black sheep
Name: Text, dtype: object


>>> import spacy
>>> nlp = spacy.load('en')
>>> for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1, n_threads=4):
...     print (parsed_doc[0].text, parsed_doc[0].tag_)
bar NN


>>> df['Parsed'] = df['Text'].apply(nlp)

>>> df['Parsed'].iloc[0]
Foo bar bar
>>> type(df['Parsed'].iloc[0])
<class 'spacy.tokens.doc.Doc'>
>>> df['Parsed'].iloc[0][0].tag_
>>> df['Parsed'].iloc[0][0].text



$ cat test.tsv 
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot

$ tail -n 2 test.tsv > rows2

$ perl -ne 'print "$_" x1000000' rows2 > rows2000000

$ cat test.tsv rows2000000 > test-2M.tsv

$ wc -l test-2M.tsv 
 2000003 test-2M.tsv

$ head test-2M.tsv 
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot


import time

import pandas as pd
import spacy

df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')

start = time.time()
for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1000, n_threads=4):
    x = parsed_doc[0].tag_
print (time.time() - start)


import time

import pandas as pd
import spacy

df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')

start = time.time()
df['Parsed'] = df['Text'].apply(nlp)

for doc in df['Parsed']:
    x = doc[0].tag_
print (time.time() - start)

答案 1 :(得分:1)

我认为亚历克西斯使用pandas .apply()的评论是最好的答案,这对我来说非常有用:

import spacy 

df = pd.read_csv('doc filename.txt')
df['text_as_spacy_objects'] = df['text column name'].apply(nlp)

答案 2 :(得分:0)

这应该非常简单 - 您可以使用任何想要从数据库中读取文本的方法(Pandas数据框,CSV读取器等),然后迭代它们。

最终取决于您想要做什么以及如何处理文本 - 如果您想单独处理每个文本,只需逐行遍历数据:

for id, line in text:
    doc = nlp(line)
    # do something with each text


text = open('some_large_text_file.txt').read()
doc = nlp(text)
