Download PDFOpen PDF in browser

Unveiling Text Mining Potential: A Comparative Analysis of Document Classification Algorithms

13 pagesPublished: March 21, 2024

Abstract

The importance of document classification has grown significantly in recent years, mostly due to the rise in digital data volumes. Since textual documents often contain more than 80% of all information, there is a perception that text mining has tremendous commercial potential. For future uses, knowledge extraction from these texts is essential. However, it is difficult to obtain this information due to the vast volume of files. As a re- sult, since text classification was introduced, the practice of classifying documents by text analysis has grown in significance. We have primarily employed three different algorithms to compare the metrics between them in order to assess the performance of various models. For this, the dataset was created by extracting condensed information from a variety of textbook genres, including business, social science, and computer science textbooks. To classify textbooks within the same subject group, we used three supervised machine learn- ing techniques in this study: decision trees, random forests, and neural networks. Among these three models, multilayer perceptron neural networks have performed and produced the best outcomes.

Keyphrases: document categorization, machine learning, neural networks, text classification, text mining

In: Ajay Bandi, Mohammad Hossain and Ying Jin (editors). Proceedings of 39th International Conference on Computers and Their Applications, vol 98, pages 103--115

Links:
BibTeX entry
@inproceedings{CATA2024:Unveiling_Text_Mining_Potential,
  author    = {Sindhuja Penchala and Saydul Akbar Murad and Indranil Roy and Bidyut Gupta and Nick Rahimi},
  title     = {Unveiling Text Mining Potential: A Comparative Analysis of Document Classification Algorithms},
  booktitle = {Proceedings of 39th International Conference on Computers and Their Applications},
  editor    = {Ajay Bandi and Mohammad Hossain and Ying Jin},
  series    = {EPiC Series in Computing},
  volume    = {98},
  pages     = {103--115},
  year      = {2024},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2398-7340},
  url       = {https://easychair.org/publications/paper/RpLs},
  doi       = {10.29007/lsgw}}
Download PDFOpen PDF in browser