我有大型语料库分为5K文件,我试图使用TF-IDF trasform生成基于IDF的词汇表。
以下是代码:基本上我有一个迭代器,它遍历.tsv文件的目录,读取每个文件并产生。
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def make_corpus():
inputFeatureFiles = [x for x in os.listdir('C:\Folder') if x.endswith("*.tsv")]
for file in inputFeatureFiles:
filePath= 'C:\\' + os.path.splitext(file)[0] + ".tsv"
with open(filePath, 'rb') as infile:
content = infile.read()
yield content
corpus = make_corpus()
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True, max_df=0.7, smooth_idf=True)
vectorizer.fit_transform(corpus)
这会产生以下错误:
c:\python27\lib\site-packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
809 vocabulary = dict(vocabulary)
810 if not vocabulary:
--> 811 raise ValueError("empty vocabulary; perhaps the documents only"
812 " contain stop words")
813
ValueError: empty vocabulary; perhaps the documents only contain stop words
我也试过这个:
corpusGenerator= [open(os.path.join('C:\CorpusFiles\',f)) for f in os.listdir('C:\CorpusFiles')]
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True, sublinear_tf=True, input="file", min_df=1)
feat = vectorizer.fit_transform(corpusGenerator)
并得到以下错误:
[Errno 24] Too many open files: 'C:\CorpusFiles\file1.tsv'
在大型语料库中使用TFIDFVectorizer的最佳方法是什么?我还尝试在每个yield字符串中附加一个常量字符串,以避免第一个错误,但也没有修复它。感谢任何帮助!
答案 0 :(得分:0)
嘿,最近我也研究了同样的问题。根据我的经验,也许您可以尝试以下演示代码:
ublic void onMapReady(GoogleMap googleMap) {
mMap = googleMap;
mMap.moveCamera(CameraUpdateFactory.newLatLngZoom(CILACAP, 10f));
mapWrapperLayout = (MapWrapperLayout)findViewById(R.id.map_relative_layout);
// MapWrapperLayout initialization
// 39 - default marker height
// 20 - offset between the default InfoWindow bottom edge and it's content bottom edge
mapWrapperLayout.init(mMap, getPixelsFromDp(this, 39 + 20));
// We want to reuse the info window for all the markers,
// so let's create only one class member instance
this.infoWindow = (ViewGroup)getLayoutInflater().inflate(R.layout.custom_info_window, null);
this.infoTitle = (TextView)infoWindow.findViewById(R.id.title);
this.infoSnippet = (TextView)infoWindow.findViewById(R.id.snippet);
this.infoButton = (Button)infoWindow.findViewById(R.id.button);
// Setting custom OnTouchListener which deals with the pressed state
// so it shows up
this.infoButtonListener = new OnInfoWindowElemTouchListener(infoButton,
getResources().getDrawable(R.drawable.button_normal),
getResources().getDrawable(R.drawable.button_pressed)) {
@Override
protected void onClickConfirmed(View v, Marker marker) {
// Here we can perform some action triggered after clicking the button
//Toast.makeText(MainActivity.this, "Tombol " + marker.getTitle() + " di click!", Toast.LENGTH_SHORT).show();
Intent pdf = new Intent(MainActivity.this, showPdf.class);
pdf.putExtra("title", marker.getTitle());
startActivity(pdf);
}
};
this.infoButton.setOnTouchListener(infoButtonListener);
mMap.setInfoWindowAdapter(new GoogleMap.InfoWindowAdapter() {
@Override
public View getInfoWindow(Marker marker) {
return null;
}
@Override
public View getInfoContents(Marker marker) {
// Setting up the infoWindow with current's marker info
infoTitle.setText(marker.getTitle());
infoSnippet.setText(marker.getSnippet());
//infoImage.setImage(marker.getMarker());
infoButtonListener.setMarker(marker);
// We must call this to set the current marker and infoWindow references
// to the MapWrapperLayout
mapWrapperLayout.setMarkerWithInfoWindow(marker, infoWindow);
return infoWindow;
}
});
// Let's add a couple of markers
mMap.addMarker(new MarkerOptions()
.title("Kecamatan Adipala")
.snippet("Kecamatan Adipala")
.position(ADIPALA)
.icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
);
mMap.addMarker(new MarkerOptions()
.title("Kecamatan Bantarsari")
.snippet("Kecamatan Bantarsari")
.position(BANTARSARI).icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
);
mMap.addMarker(new MarkerOptions()
.title("Kecamatan Binangun")
.snippet("Kecamatan Binangun")
.position(BINANGUN)
.icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
);
mMap.addMarker(new MarkerOptions()
.title("Kecamatan Cilacap Utara")
.snippet("Kecamatan Cilacap Utara")
.position(CILACAPUTARA).icon(bitmapDescriptorFromVector(getApplicationContext(), R.drawable.iconbuilding))
);
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
mapWrapperLayout = (MapWrapperLayout)findViewById(R.id.map_relative_layout);
// Obtain the SupportMapFragment and get notified when the map is ready to be used.
SupportMapFragment mapFragment = (SupportMapFragment) getSupportFragmentManager().findFragmentById(R.id.map);
mapFragment.getMapAsync(this);
}
祝你好运!