带有发电机的大型语料库上的TfidfVectorizer

时间:2018-06-07 20:57:36

标签: python scikit-learn generator corpus tfidfvectorizer

我有大型语料库分为5K文件,我试图使用TF-IDF trasform生成基于IDF的词汇表。

以下是代码:基本上我有一个迭代器,它遍历.tsv文件的目录,读取每个文件并产生。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def make_corpus():
    inputFeatureFiles = [x for x in os.listdir('C:\Folder') if x.endswith("*.tsv")]
    for file in inputFeatureFiles:
        filePath= 'C:\\' + os.path.splitext(file)[0] + ".tsv"
        with open(filePath, 'rb') as infile:
            content = infile.read()
            yield content 

corpus = make_corpus()
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True, max_df=0.7, smooth_idf=True)

vectorizer.fit_transform(corpus)

这会产生以下错误:

c:\python27\lib\site-packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    809             vocabulary = dict(vocabulary)
    810             if not vocabulary:
--> 811                 raise ValueError("empty vocabulary; perhaps the documents only"
    812                                  " contain stop words")
    813 

ValueError: empty vocabulary; perhaps the documents only contain stop words

我也试过这个:

corpusGenerator= [open(os.path.join('C:\CorpusFiles\',f)) for f in os.listdir('C:\CorpusFiles')]
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True, sublinear_tf=True, input="file", min_df=1)
feat = vectorizer.fit_transform(corpusGenerator)

并得到以下错误:

[Errno 24] Too many open files: 'C:\CorpusFiles\file1.tsv'

在大型语料库中使用TFIDFVectorizer的最佳方法是什么?我还尝试在每个yield字符串中附加一个常量字符串,以避免第一个错误,但也没有修复它。感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

嘿,最近我也研究了同样的问题。根据我的经验,也许您可​​以尝试以下演示代码:

ublic void onMapReady(GoogleMap googleMap) {
    mMap = googleMap;
    mMap.moveCamera(CameraUpdateFactory.newLatLngZoom(CILACAP, 10f));

    mapWrapperLayout = (MapWrapperLayout)findViewById(R.id.map_relative_layout);

    // MapWrapperLayout initialization
    // 39 - default marker height
    // 20 - offset between the default InfoWindow bottom edge and it's content bottom edge
    mapWrapperLayout.init(mMap, getPixelsFromDp(this, 39 + 20));

    // We want to reuse the info window for all the markers,
    // so let's create only one class member instance
    this.infoWindow = (ViewGroup)getLayoutInflater().inflate(R.layout.custom_info_window, null);
    this.infoTitle = (TextView)infoWindow.findViewById(R.id.title);
    this.infoSnippet = (TextView)infoWindow.findViewById(R.id.snippet);
    this.infoButton = (Button)infoWindow.findViewById(R.id.button);

    // Setting custom OnTouchListener which deals with the pressed state
    // so it shows up
    this.infoButtonListener = new OnInfoWindowElemTouchListener(infoButton,
            getResources().getDrawable(R.drawable.button_normal),
            getResources().getDrawable(R.drawable.button_pressed)) {
        @Override
        protected void onClickConfirmed(View v, Marker marker) {
            // Here we can perform some action triggered after clicking the button
            //Toast.makeText(MainActivity.this, "Tombol " + marker.getTitle() + " di click!", Toast.LENGTH_SHORT).show();
            Intent pdf = new Intent(MainActivity.this, showPdf.class);
            pdf.putExtra("title", marker.getTitle());
            startActivity(pdf);
        }
    };
    this.infoButton.setOnTouchListener(infoButtonListener);


    mMap.setInfoWindowAdapter(new GoogleMap.InfoWindowAdapter() {
        @Override
        public View getInfoWindow(Marker marker) {
            return null;
        }

        @Override
        public View getInfoContents(Marker marker) {
            // Setting up the infoWindow with current's marker info
            infoTitle.setText(marker.getTitle());
            infoSnippet.setText(marker.getSnippet());
            //infoImage.setImage(marker.getMarker());
            infoButtonListener.setMarker(marker);

            // We must call this to set the current marker and infoWindow references
            // to the MapWrapperLayout
            mapWrapperLayout.setMarkerWithInfoWindow(marker, infoWindow);
            return infoWindow;
        }
    });

    // Let's add a couple of markers
    mMap.addMarker(new MarkerOptions()
            .title("Kecamatan Adipala")
            .snippet("Kecamatan Adipala")
            .position(ADIPALA)
            .icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
    );

    mMap.addMarker(new MarkerOptions()
            .title("Kecamatan Bantarsari")
            .snippet("Kecamatan Bantarsari")
            .position(BANTARSARI).icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
    );

    mMap.addMarker(new MarkerOptions()
            .title("Kecamatan Binangun")
            .snippet("Kecamatan Binangun")
            .position(BINANGUN)
            .icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
    );
    mMap.addMarker(new MarkerOptions()
            .title("Kecamatan Cilacap Utara")
            .snippet("Kecamatan Cilacap Utara")
            .position(CILACAPUTARA).icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
    );

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);

    mapWrapperLayout = (MapWrapperLayout)findViewById(R.id.map_relative_layout);

    // Obtain the SupportMapFragment and get notified when the map is ready to be used.
    SupportMapFragment mapFragment = (SupportMapFragment) getSupportFragmentManager().findFragmentById(R.id.map);
    mapFragment.getMapAsync(this);

}

祝你好运!