spaCy [正,pos,标签,lema和文本]的文档

时间:2017-05-16 00:15:48

标签: python nlp cython spacy

我是spaCy的新手。我添加了这篇文章作为文档,让我的新手很简单。

use Prod_data
declare @ReportingStart datetime = dateadd(HH,-17,convert(datetime,convert(date,getdate())))
declare @ReportingEnd datetime = dateadd(HH,7,convert(datetime,convert(date,getdate())))


-- Daily Production time
declare @Production float = (select sum(dDurationSeconds/60)
from OEEQStateData
where tstart >= @ReportingStart and tstart < @ReportingEnd
and sStateDescription = 'Production'and sWorkcellDescription ='Hoisting')

-- Daily Idle time
declare @Idle float = (select isnull(sum(dDurationSeconds/60),0)
from OEEQStateData
where tstart >= @ReportingStart and tstart < @ReportingEnd
and sStateDescription = 'Idle Time'and sWorkcellDescription ='Hoisting')

-- Daily Unplanned time
declare @Unplanned float = (select sum(dDurationSeconds/60)
from OEEQStateData
where tstart >= @ReportingStart and tstart < @ReportingEnd
and sStateDescription like 'Unplanned%'and sWorkcellDescription ='Hoisting')

--Daily Maintenance time
declare @Planned float = (select sum(dDurationSeconds/60)
from OEEQStateData
where tstart >= @ReportingStart and tstart < @ReportingEnd
and sStateDescription like 'Planned%'and sWorkcellDescription ='Hoisting')

--Util
declare @Util float = @Production/(1440-@Planned-@Unplanned)

--Avail
declare @Avail float = ((@Production+@Idle)/1440)

--Hoist Schedule
declare @HoistSched int = (select round(DS_Prod+NS_Prod,-2)
from Schedule
where date = convert(date,@ReportingStart))


--Hoist Schedule for tomorrow 
declare @HoistSchedTom int = (select round(DS_Prod+NS_Prod,-2) 
from Schedule
where date = convert(date,@ReportingEnd))

--PM for tommorrow
declare @PM int = (select (DS_DT+NS_DT) 
from Schedule
where date = convert(date,dateadd(dd,1,getdate())))

--Hoist Daily Production

declare @Tonnes int = (select top 1
    case
        when coalesce(lead(value) over(partition by tagname order by datetime),0) - value < '0' then ''
        else coalesce(lead(value) over(partition by tagname order by datetime),0) - value
    end
 from  Linked_Database
 where datetime between @ReportingStart and @ReportingEnd
 and wwResolution = (1440 * 60000)
 and tagname = 'SALV_CV005_WX1_PROD_DATA.Actual_Input'
 )

 --MPS 24HR

declare @MPS_today float = (select sum(value)
from  Linked_Database
 where datetime = @ReportingEnd
 and tagname like 'MPS_FI7940%.Actual_Input')

 declare @MPS_yest float = ( select sum(value) 
from  Linked_Database
 where datetime = @ReportingStart
 and tagname like 'MPS_FI7940%.Actual_Input')

declare @MPS_total float = (@MPS_today-@MPS_yest)

--IPDW 24HR (claypit + IPDW)

declare @IPDW_today float = (select isnull(sum(value),0)
from  Linked_Database
 where datetime = @ReportingEnd
 and tagname like '%FI792%.Actual_Input')

 declare @Clay_today float = (select isnull(sum(value),0) 
from  Linked_Database
 where datetime = @ReportingEnd
 and tagname like '%FI764%_TOTAL.PVAI')

 declare @IPDW_yest float = (select isnull(sum(value),0) 
from  Linked_Database
 where datetime = @ReportingStart
 and tagname like '%FI792%.Actual_Input')

 declare @Clay_yest float = (select isnull(sum(value),0) 
from  Linked_Database
 where datetime = @ReportingStart
 and tagname like '%FI764%_TOTAL.PVAI')

 declare @IPDW_total float = (@IPDW_today+@Clay_today-@IPDW_yest-@Clay_yest)

--Average airflow across both vent fan

declare @VF_Avg float = (select avg(value) 
from  Linked_Database
 where datetime between @ReportingStart and @ReportingEnd
 and tagname = 'vfans_totalairflow.pv_at')

 --BAC wet bulb
declare @BAC_Wet float = (select avg(value) 
from  Linked_Database
 where datetime between @ReportingStart and @ReportingEnd
 and tagname = 'gb_bac_tt787125a._analog_PV')

 declare @BAC_Dry float = (select avg(value) 
from  Linked_Database
 where datetime between @ReportingStart and @ReportingEnd
 and tagname = 'gb_bac_tt787125b._analog_PV')

  --Final Select Statement
 select @HoistSched as Hoist_Sched_today, @HoistSchedTom as Hoist_Sched_Tom, @PM as PM_Tom, @Tonnes as Hoist_Act, @Util as Hoist_Util, @Avail as Hoist_Avail, @MPS_total as MPS_Dewatering_Total, @IPDW_total as IPDW_Dewatering_Total,  @VF_Avg as VFan_AVG, @BAC_Dry as BAC_Dry_AVG, @BAC_Wet as BAC_Wet_AVG

我希望了解orth,lemma,tag和pos的含义是什么?此代码还会打印出值import spacy nlp = spacy.load('en') doc = nlp(u'KEEP CALM because TOGETHER We Rock !') for word in doc: print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_) print(word.orth_) print(word)

之间的差异

2 个答案:

答案 0 :(得分:13)

  

orth,lemma,tag和pos的含义是什么?

请参阅https://spacy.io/docs/usage/pos-tagging#pos-schemes

  

print(word)vs print(word.orth _)

之间有什么不同

超短:

word.orth_word.text是相同的。事实上,cython属性以下划线结尾,它通常是开发人员并不真正希望向用户公开的变量。

简而言之:

当您访问https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537处的word.orth_属性时,它会尝试访问保留所有词汇词汇的索引:

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

(有关详情,请参阅下面的 In long ,了解self.c.lex.orth

并且word.text返回仅包含orth_属性的单词的字符串表示形式,请参阅https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_

当您重新打印print(word)时,它会调用__repr__ dunder函数,该函数返回word.__unicode__word.__byte__,返回word.text变量,请参阅https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()

长期:

让我们一步一步地完成这个步骤:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>

将句子传递到nlp()函数后,它会从文档中生成spacy.tokens.doc.Doc个对象:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """

因此spacy.tokens.doc.Doc对象是spacy.tokens.token.Token对象的序列。在Token对象中,我们看到列举了一系列cython property,例如在https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
    def __get__(self):
        return self.c.lex.orth

追溯,我们看到self.c = &self.doc.c[offset]

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

如果没有完整的文档,我们真的不知道self.c的含义,但从它的外观来看,它正在访问指向&self.doc引用中的一个令牌传递到Doc doc函数的{1}}。所以最有可能的是,它是访问令牌的捷径

查看__cinit__

Doc.c

现在我们看到cdef class Doc: def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None): self.vocab = vocab size = 20 self.mem = Pool() # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds # However, we need to remember the true starting places, so that we can # realloc. data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC)) cdef int i for i in range(size + (PADDING*2)): data_start[i].lex = &EMPTY_LEXEME data_start[i].l_edge = i data_start[i].r_edge = i self.c = data_start + PADDING 指的是一个cython指针数组Doc.c,它分配内存来存储data_start对象(如果我得到解释,请纠正我{{ 1}}错误)。

回到spacy.tokens.doc.Doc,它基本上试图访问存储数组的存储点,更具体地说是访问&#34; offset-th&#34;数组中的项目。

<TokenC*>是什么。

回到self.c = &self.doc.c[offset]

spacy.tokens.token.Token

我们看到property正在访问data_start[i].lex from spacy.tokens.doc.Docproperty orth: def __get__(self): return self.c.lex.orth 只是一个整数,表示self.c.lex中保留的单词出现的索引内部词汇。

因此,我们看到self.c.lex.orth尝试使用spacy.tokens.doc.Doc https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

中的te索引访问property orth_
self.vocab.strings

答案 1 :(得分:1)

1)当您打印word时,您基本上从spacy打印Token类,该类设置为从类中打印出字符串。您可以看到更多here。所以它与打印word.orth_word.text不同,它们将直接打印出字符串。

2)我不确定word.orth_,对于大多数情况来说似乎是word.text。对于word.lemma_,它是给定单词的lemmatize,例如isamare将映射到be中的word.lemma_