I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?
And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.
here is my code:
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore') vectorized=vectorizer.fit_transform(sentence_list) #count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore') #count_vectorized=count_vectorizerfit_transform(sentence_list) km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1) km.fit(vectorized)
Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.
This is how you can horizontally stack your features:
import scipy.sparse X = scipy.sparse.hstack([vectorized, count_vectorized])
Then you can just do:
model.fit(X, y) # y is optional in some models