52647

Text classification performance

Question:

So i am using textblob python library, but the performance is lacking.

I already serialize it and load it before the loop( using pickle ).

It currently takes ~ 0.1( for small training data ) and ~ 0.3 on 33'000 test data. I need to make it faster, is it even possible ?

<strong>Some code:</strong>

# Pass trainings before loop, so we can make performance a lot better trained_text_classifiers = load_serialized_classifier_trainings(config["ALL_CLASSIFICATORS"]) # Specify witch classifiers are used by witch classes filter_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["FILTER_CLASSIFICATORS"]) signal_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["SIGNAL_CLASSIFICATORS"]) for (url, headers, body) in iter_warc_records(warc_file, **warc_filters): start_time = time.time() body_text = strip_html(body); # Check if url body passess filters, if yes, index, if no, ignore if Filter.is_valid(body_text, filter_classifiers): print "Indexing", url.url resp = indexer.index_document(body, body_text, signal_classifiers, url=url, headers=headers, links=bool(args.save_linkgraph_domains)) else: print "\n" print "Filtered out", url.url print "\n" resp = 0

This is the loop witch performs check on each of the warc file's body and metadata.

there are 2 text classification checks here.

1) In Filter( very small training data ):

if trained_text_classifiers.classify(body_text) == "True": return True else: return False

2) In index_document( 33'000 training data ):

prob_dist = trained_text_classifier.prob_classify(body) prob_dist.max() # Return the propability of spam return round(prob_dist.prob("spam"), 2)

The classify and prob_classify are the methods that take the tool on performance.

Answer1:

You can use feature selection for your data. some good feature selection can reduce features up to 90% and persist the classification performance. In feature selection you select top feature(in <strong>Bag Of Word</strong> model, you select top influence words), and train model based on these words(features). this reduce the dimension of your data(also it prevent Curse Of Dimensionality) here is a good survey: <a href="https://arxiv.org/pdf/1602.02850.pdf" rel="nofollow">Survey on feature selection</a>

In Brief:

Two feature selection approach is available: Filtering and Wrapping

Filtering approach is almost based on information theory. search "Mutual Information", "chi2" and... for this type of feature selection

Wrapping approach use the classification algorithm to estimate the most important features in the library. for example you select some words and evaluate classification performance(recall,precision).

Also some others approch can be usefull. LSA and LSI can outperform the classification performance and time: <a href="https://en.wikipedia.org/wiki/Latent_semantic_analysis" rel="nofollow">https://en.wikipedia.org/wiki/Latent_semantic_analysis</a>

You can use sickit for feature selection and LSA:

<a href="http://scikit-learn.org/stable/modules/feature_selection.html" rel="nofollow">http://scikit-learn.org/stable/modules/feature_selection.html</a>

<a href="http://scikit-learn.org/stable/modules/decomposition.html" rel="nofollow">http://scikit-learn.org/stable/modules/decomposition.html</a>

Recommend

  • Digging down json file
  • Django error / value too long for type character varying(1)
  • Django Reverse for '' with arguments '()' and keyword arguments '{'}&#
  • Limiting available choices in a Django formset
  • Simple Django form in Twitter-Bootstrap modal
  • Accessing form fields as properties in a django view
  • Django Form Based on Variable Attributes
  • Can't get image_id from Cloudinary in NodeJS
  • Text classification performance
  • python - Django display uploaded multiple images with foreign key
  • Serializing optionally nested structures: Difference between QueryDict and normal dict?
  • Why is My Django Form Executed Twice?
  • Django - Form bind data after initialization
  • how to send a non-english word (chinese) email using django
  • Django login form in bootstrap popup
  • how to insert new rows with values in the same sheet of an excel file in java
  • Index Multiple Columns w/ Ruby on Rails
  • Zend Framework 2 - Building a simple form with Validators
  • C++ std::set comparator
  • Prolog Query - Trying to understand how this result happens
  • Thrust filter by key value
  • how to sort an arraylist that contains string arrays?
  • Does the failbit effect the call ignore on cin?
  • p:fileDownload in p:dataTable does not work (just refreshes page) after performing search on the p:d
  • Wrap C++ function using Boost Reflect or another C++ reflection library
  • MS Access - How to change the linked table path by amend the table
  • How solve “Qt: Untested Windows version 10.0 detected!”
  • Django rest serializer Breaks when data exists
  • Can I check if a recipient has an automatic reply before I send an email?
  • PHPUnit_Framework_TestCase class is not available. Fix… - Makegood , Eclipse
  • Spring security and special characters
  • Rearranging Cells in UITableView Bug & Saving Changes
  • align graphs with different xlab
  • Return words with double consecutive letters
  • retrieve vertices with no linked edge in arangodb
  • Proper way to use connect-multiparty with express.js?
  • Linking SubReports Without LinkChild/LinkMaster
  • Reading document lines to the user (python)
  • Conditional In-Line CSS for IE and Others?
  • Python/Django TangoWithDjango Models and Databases