Classification of Turkish Documents Using Paragraph Vector
Abstract
Text processing and mining gained a lot of traction recently due to rising interest in integration of Natural Language Processing with data analytics algorithms, in particular Deep Learning Models. In this study, newspaper columnists are classified according to vector models created by their posts. Hence, we may not only be able to determine an unclassified post's author, but also author profiles can be formed by grouping similar styles together. DeepLearning4J Java library and Doc2Vec class are mainly the preferred deep learning solutions for text mining. The vector models of 5, 10, 15, and 20 authors were created from 20k corner posts. Two particular implementations, Distributed Memory (PV-DM) and Distributed Bag of Words (PV-DBOW) models were adapted and their performances are compared. According to the results, it is seen that some authors are clearly distinguished from other authors. Such a model can be used for author profile extraction, plagiarism detection and identifying which author styles are similar. © 2018 IEEE.