Journal-Archives-Advances in computing & Management


TITLE :	APPLICATION OF FEATURE EXTRACTION TECHNIQUE TO UNSTRUCTURED TEXTS
AUTHORS :	Isabella J Suresh R.M
ABSTRACT :	Inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. IDF can be successfully used for stop-words filtering in various subject fields including text summarization and classification. Text Classification is a semi-supervised machine learning task that automatically assigns a given document to a set of pre-defined categories based on its textual content and extracted features. Text Classification has important applications in content management, contextual search, opinion mining, product review analysis, spam filtering and text sentiment mining. This paper explains the usage of Inverse document frequency for dealing with unstructured text, handling large number of attributes and the application of K-Nearest neighbour classifier to classify the documents. Keywords: Inverse document frequency, IMDb, Mining, KNN classifier and Nai've-Bayes Classifier
	Download Full Paper