text categorization
594 Views

GC: n

CT: Text categorization (also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set.
This task has several applications, including automated indexing of scientific arti- cles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, survey coding, and even automated essay grading. Automated text classification is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply not feasible given the time constraints of the application or the number of documents involved. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology.

S: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.1540&rep=rep1&type=pdf (last access: 25 February 2015)

N: 1. Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. For example, news stories are typically organized by subject categories (topics) or geographical codes; academic papers are often classified by technical domains and sub-domains; patient reports in health-care organizations are often indexed from multiple aspects, using taxonomies of disease categories, types of surgical procedures, insurance reimbursement codes and so on. Another widespread application of text categorization is spam filtering, where email messages are classified into the two categories of spam and non-spam, respectively.
2. Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems.

S: 1. http://www.scholarpedia.org/article/Text_categorization (last access: 25 February 2015). 2. http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf (last access: 25 February 2015).

SYN: text classification, topic spotting.

S: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.1540&rep=rep1&type=pdf (last access: 25 February 2015)

CR: automatic natural language processing