Can machine learning help criminal investigations by identifying the authors?

Add to my custom PDF

Learning Stylometric Representations for Authorship Analysis

The anonymous nature of the Internet makes it difficult to conclusively identify people online. The certainty of network-based identifications, such as IP address can be easily disputed. This presents an issue in the investigation of criminal activity. Stylometry is the study of linguistic style in order to differentiate between authors. This type of analysis has been presented in courts in the form of expert testimony to assist identifying the authors of texts. While authorship analysis is nothing new, previous techniques have relied on the manual selection of linguistic features to be analyzed.

Researchers Ding, Fung, Iqbal, and Cheung have developed models to automate the process of selecting subsets of linguistic features to be studied by leveraging machine learning techniques. Their proposed models for automated authorship analysis incorporate different sets of linguistic attributes for extraction and processing. These attributes include the content of the writing, contextual word choices, and grammatical and syntactical choices. By using an automated feature learning scheme that is guided by multi-pronged subsets of linguistic information, the researchers attempted to mitigate the issues related to the manual feature engineering process in current authorship analysis. Their proposed models are designed to effectively capture the differences of writing styles of different modalities between authors

The researchers used their system to analyze a publicly available database of English texts that is generally used for digital text forensics training and testing and contains hundreds of novels and essays. They compared the results of their model’s ability to verify an author to other existing models. They then compared the performance of their model using a publicly available database of Twitter posts.

The researchers’ experiment suggests that their proposed multi-pronged models of analysis outperforms preexisting authorship analysis models and are effective and robust on numerous datasets and authorship analysis problems.

Advancements in authorship analysis can assist cybercrime investigations as well as provide analysis techniques for market, social network and social sciences research. The learning authorship analysis models present an advancement in automated stylometry that may prove valuable in cyber forensics analysis.

Advancements in automated author recognition using the texts they write online may assist cybercrime investigations.