1662

Does it make sense to use both countvectorizer and tfidfvectorizer as feature vectors for text clust

I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?

And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.

here is my code:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore') vectorized=vectorizer.fit_transform(sentence_list) #count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore') #count_vectorized=count_vectorizerfit_transform(sentence_list) km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1) km.fit(vectorized)

Answer1:

Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.

This is how you can horizontally stack your features:

import scipy.sparse X = scipy.sparse.hstack([vectorized, count_vectorized])

Then you can just do:

model.fit(X, y) # y is optional in some models

Recommend

  • Improve flow Python classifier and combine features
  • Handmade Estimator modifies parameters in __init__?
  • Converting a dictionary into numbering format
  • Changing phrases to vectors with while function in Python
  • How can I use a list of lists, or a list of sets, for the TfidfVectorizer?
  • Sklearn cosine similarity for strings, Python
  • IOS sort NSMutableArray for a numeric field
  • What causes erratic GPS estimates during certain time intervals?
  • Selection Sort, For Java
  • Custom locale in Android
  • How to change the default configuration files used in bootstrapping of reactJs through npm
  • CFBundleDevelopmentRegion not works as expected
  • Cannot find “Grammar.txt” in python-sphinx
  • configure: error: no acceptable C compiler found in $PATH
  • JPA flush vs commit
  • Mysterious problem with floating point in LISP - time axis generation
  • one Local Olampyad Questions on Informatic in 2011
  • Problems to linebreak with an int in JLabel
  • Date difference with leap year
  • How would I use PHP exceptions to define a redirect?
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Join two tables and save into third-sql
  • JSON with duplicate key names losing information when parsed
  • Display Images one by one with next and previous functionality
  • Why is the timeout on a windows udp receive socket always 500ms longer than set by SO_RCVTIMEO?
  • Jquery - Jquery Wysiwyg return html as a string
  • Matplotlib draw Spline from multiple points
  • XCode can't find symbols for a specific iOS library/framework project
  • Calling of Constructors in a Java
  • SVN: Merging two branches together
  • Compare two NSDates in iPhone
  • Transpose CSV data with awk (pivot transformation)
  • Use group_by to filter specific cases while keeping NAs
  • log4net write single file for each call to log.info
  • Benchmarking RAM performance - UWP and C#
  • Acquiring multiple attributes from .xml file in c#
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • How can I remove ASP.NET Designer.cs files?
  • Append folder name and increment by 1 using batch script
  • java string with new operator and a literal