So i am using textblob python library, but the performance is lacking.
I already serialize it and load it before the loop( using pickle ).
It currently takes ~ 0.1( for small training data ) and ~ 0.3 on 33'000 test data. I need to make it faster, is it even possible ?<strong>Some code:</strong>
# Pass trainings before loop, so we can make performance a lot better
trained_text_classifiers = load_serialized_classifier_trainings(config["ALL_CLASSIFICATORS"])
# Specify witch classifiers are used by witch classes
filter_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["FILTER_CLASSIFICATORS"])
signal_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["SIGNAL_CLASSIFICATORS"])
for (url, headers, body) in iter_warc_records(warc_file, **warc_filters):
start_time = time.time()
body_text = strip_html(body);
# Check if url body passess filters, if yes, index, if no, ignore
if Filter.is_valid(body_text, filter_classifiers):
print "Indexing", url.url
resp = indexer.index_document(body, body_text, signal_classifiers, url=url, headers=headers, links=bool(args.save_linkgraph_domains))
print "Filtered out", url.url
resp = 0
This is the loop witch performs check on each of the warc file's body and metadata.
there are 2 text classification checks here.
1) In Filter( very small training data ):
if trained_text_classifiers.classify(body_text) == "True": return True else: return False
2) In index_document( 33'000 training data ):
prob_dist = trained_text_classifier.prob_classify(body) prob_dist.max() # Return the propability of spam return round(prob_dist.prob("spam"), 2)
The classify and prob_classify are the methods that take the tool on performance.Answer1:
You can use feature selection for your data. some good feature selection can reduce features up to 90% and persist the classification performance. In feature selection you select top feature(in <strong>Bag Of Word</strong> model, you select top influence words), and train model based on these words(features). this reduce the dimension of your data(also it prevent Curse Of Dimensionality) here is a good survey: <a href="https://arxiv.org/pdf/1602.02850.pdf" rel="nofollow">Survey on feature selection</a>
Two feature selection approach is available: Filtering and Wrapping
Filtering approach is almost based on information theory. search "Mutual Information", "chi2" and... for this type of feature selection
Wrapping approach use the classification algorithm to estimate the most important features in the library. for example you select some words and evaluate classification performance(recall,precision).
Also some others approch can be usefull. LSA and LSI can outperform the classification performance and time: <a href="https://en.wikipedia.org/wiki/Latent_semantic_analysis" rel="nofollow">https://en.wikipedia.org/wiki/Latent_semantic_analysis</a>
You can use sickit for feature selection and LSA:
<a href="http://scikit-learn.org/stable/modules/feature_selection.html" rel="nofollow">http://scikit-learn.org/stable/modules/feature_selection.html</a>
<a href="http://scikit-learn.org/stable/modules/decomposition.html" rel="nofollow">http://scikit-learn.org/stable/modules/decomposition.html</a>