gensim.corpora.Dictionary的术语频率是否已保存?

时间:2017-10-11 09:37:56

标签: python dictionary frequency gensim tf-idf

gensim.corpora.Dictionary是否保存了术语频率?

gensim.corpora.Dictionary开始,可以获得单词的文档频率(即特定单词出现的文档数量):

Add-Type -AssemblyName PresentationFramework, System.Windows.Forms, System.Drawing, System.IO
[xml]$xaml=@"
<Window
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
Title="MainWindow" Height="410" Width="670" WindowStartupLocation="CenterScreen">
<Grid>
<DataGrid x:Name="dgResults" AutoGenerateColumns="False">
    <DataGrid.Columns>
        <DataGridTemplateColumn Header="Icon" IsReadOnly="True">
            <DataGridTemplateColumn.CellTemplate>
                <DataTemplate>
                    <Image x:Name="icon" Source="{Binding Icon}" Width="24" Height="24" />
                </DataTemplate>
            </DataGridTemplateColumn.CellTemplate>
        </DataGridTemplateColumn>
        <DataGridTextColumn Binding="{Binding Process}" Header="Process" IsReadOnly="True"/>
    </DataGrid.Columns>
</DataGrid>
</Grid>
</Window>
"@

$reader=(New-Object System.Xml.XmlNodeReader $xaml)
$Window=[Windows.Markup.XamlReader]::Load($reader)

$xaml.SelectNodes("//*[@*[contains(translate(name(.),'n','N'),'x:Name')]]") | ForEach-Object{
    Set-Variable -Name ($_.Name) -Value $Window.FindName($_.Name)
}

function ConvertTo-Icon{
    Param(
        [Parameter(Mandatory=$true)][object]$Icon
    )

    $bmp = $Icon.ToBitmap()
    $stream = New-Object System.IO.MemoryStream
    $bmp.Save($stream, [System.Drawing.Imaging.ImageFormat]::Png)
    $imageSource = [System.Windows.Media.Imaging.BitmapFrame]::Create($stream)

    # Set source here. Take note in the XAML as to where the variable name was taken.
    return $imageSource
}

$arrayItems = @()

Get-Process | Select-Object Name -First 5 | ForEach-Object{
    $itemObject = New-Object System.Object
    $itemObject | Add-Member -Type NoteProperty -Name "Icon" -Value (ConvertTo-Icon -Icon ([System.Drawing.Icon]::ExtractAssociatedIcon("C:\windows\System32\cmd.exe")))
    $itemObject | Add-Member -Type NoteProperty -Name "Process" -Value $_.Name
    $arrayItems += $itemObject
}

$dgResults.ItemsSource = $arrayItems

#Display Form
$Window.ShowDialog() | Out-Null

[OUT]:

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

还有filter_n_most_frequent(remove_n)函数可以删除第n个最常用的令牌:

  

The word "these" appears in 1213 documents   过滤出文档中出现的'remove_n'最常见的令牌。

     

修剪后,缩小字ID中的间隙。

     

注意:由于间隙缩小,在调用此函数之前和之后,相同的单词可能会有不同的单词ID!

filter_n_most_frequent(remove_n)功能是否会根据文档频率或字词频率删除第n个频率?

如果是后者,是否有某种方法可以访问filter_n_most_frequent对象中字词的术语频率?

6 个答案:

答案 0 :(得分:4)

不,gensim.corpora.Dictionary不保存期限频率。你可以see the source code here。该类仅存储以下成员变量:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

这意味着类中的所有内容都将频率定义为文档频率,而不是术语频率,因为后者永远不会全局存储。这适用于filter_n_most_frequent(remove_n)以及其他所有方法。

答案 1 :(得分:2)

我有一个简单的问题。看来单词的频率是隐藏的,无法在对象中访问。不知道为什么它会使测试和验证变得痛苦。我所做的就是将字典导出为文本。

dictionary.save_as_text('c:\\research\\gensimDictionary.txt')

在该文本文件中,它们具有三列。例如,这是单词“ summit”,“ summon”和“ sumo”

关键字频率

10个峰会1227

3658召唤118

8477相扑40

我找到了一个解决方案,.cfs是单词frequency ..参见https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary

print(str(dictionary[10]), str(dictionary.cfs[10])) 

总结1227

简单

答案 2 :(得分:1)

你可以这样做吗?

dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab = list(dictionary.values()) #list of terms in the dictionary
vocab_tf = [dict(i) for i in corpus]
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies

答案 3 :(得分:0)

字典没有它,但语料库没有。

# Term frequency
# load dictionary
dictionary = corpora.Dictionary.load('YourDict.dict')
# load corpus
corpus = corpora.MmCorpus('YourCorpus.mm')
CorpusTermFrequency = array([[(dictionary[id], freq) for id, freq in cp] for cp in corpus])

答案 4 :(得分:0)

一种有效的方法,它可以根据表示来计算词频,而不是创建密集矢量。

corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab_tf={}
for i in corpus:
    for item,count in dict(i).items():
        if item in vocab_tf:
            vocab_tf[item]+=count
        else:
            vocab_tf[item] = count

答案 5 :(得分:0)

gensim.corpora.Dictionary 现在将词频存储在其 cfs 属性中。您可以看到 documentation here

<块引用>

cfs
收集频率:token_id -> 文档中包含此令牌的多少个实例。
类型:dict of (int, int)