Cybercrime and Authorship Detection in Very Short Texts A Quantitative Morpho-lexical Approach

Document Type : Original Article

Abstract


Abstract
The present study proposes an integrated framework that considers letter-pair frequencies/combinations along with the lexical features of documents. Drawing on a quantitative morpho-lexical approach, the study tests the hypothesis that letter information or mapping carries unique stylistic features; and therefore detecting stable word combinations and morphological patterns can be used to enhance the authorship performance in relation to very short texts. The data used for analysis is a corpus of 12240 tweets derived from 87 Twitter accounts. Self-organizing maps (SOMs) model is used for classifying the input patterns that share common features together as a clue that tweets grouped under one class membership are written by the same author. Results indicate that the classification accuracy based on the proposed system is around 76%. Up to 22% of this accuracy was lost, however, when only distinctive words were used, and 26% was lost when the classification performance was based on letter combinations and morphological patterns only. The integration of letter-pairs and morphological patterns had the advantage of improving the accuracy of determining the author of a given tweet. This indicates that the integration of different linguistic variables into an integrated system leads to a better classification performance of very short texts. It is also clear that the use of the self-organizing map (SOM) led to better clustering performance for its capacity to integrate two different linguistic levels of each author profile together. 
.

Keywords


Blann, A. (2015). Data Handling and Analysis. Oxford: Oxford
University Press.
Brena, R. F. (2011).
Quantitative Semantics and Soft Computing Methods
for the Web: Perspectives and Applications: Perspectives and
Applications
: Information Science Reference.
Chaski, C. E. (2012). Author Identification In The Forensic Setting In L.
M. Solan & P. M. Tiersma (Eds.),
The Oxford Handbook of
Language and Law
. Oxford: Oxford University Press.
Chen, Q., Lee, F., Kotani, K., & Ohmi, T. (2010).
Face Recognition
Using Self-Organizing Maps
: INTECH Open Access Publisher.
Chen, Y., Zhou, Y., Zhu, S., & Xu, H. (2012).
Detecting Offensive
Language in Social Media to Protect Adolescent Online Safety.
Paper presented at the International Conference on Privacy,
Security, Risk and Trust. 3-5 Sept. 2012
https://doi.org/10.5772/9173
Coulthard, M., & Johnson, A. (2010). An Introduction to Forensic
Linguistics: Language in Evidence
. London and New York:
Routledge.
Coulthard, M., & Johnson, A. (2013).
The Routledge Handbook of
Forensic Linguistics
. London and New York: Routledge.
Craig, H. (2004). Stylistic Analysis and Authorship Studies. In S.
Schreibman, R. Siemens, & J. Unsworth (Eds.),
A Companion to
Digital Humanities
. Oxford: Blackwell.
Davies, P., Francis, P., & Jupp, V. (2016).
Invisible crimes : their victims
and their regulation
. Basingstoke: Macmillan Press.
مجلة البحث العلمى فى الآداب العدد العشرون لسنة 9102الجزء الأول
Ferraty, F., & Romain, Y. (2011). The Oxford Handbook of Functional
Data Analysis
. Oxford: Oxford University Press.
Flexer, A. (1996). Limitations of self-organizing maps for vector
quantization and multidimensional scaling
Advances in neural
information processing systems, 9
(December ), 445-451.
Holland, J. (2017). Confederate statue toppled by protesters; more to be
removed by cities.
The Mercury News. August 16, 2017
Johnsson, M. (2012).
Applications of Self-Organizing Maps: InTech.
Juntunen, P., Liukkonen, M., Lehtola, M., & Hiltunen, Y. (2013). Cluster
analysis by self-organizing maps: An application to the modelling
of water quality in a treatment process.
Applied Soft Computing,
13
(7), 3191-3196. https://doi.org/10.1016/j.asoc.2013.01.027
Kenning, C. (2017, August 28, 2017). Confederate Monuments Are
Coming Down Across the United States.
The New York Times.
Kohonen, T. (1982). Self-organized formation of topologically correct
feature maps.
Biological Cybernetics, 43, 59-69.
Kohonen, T. (1990). The Self-Organizing Map.
Proceeding of the IEEE,
78
, 1464-1480.
Kohonen, T. (1995).
Self-Organizing Maps. Berlin, Heidelberg: Springer.
Kohonen, T. (2012).
Self-Organizing Maps (3rd ed.). Berlin, Heidelberg:
Springer.
Landrieu, M. (2018).
In the Shadow of Statues: A White Southerner
Confronts History
: Penguin Publishing Group.
Liu, Y.-C., Liu, M., & Wang, X.-L. (2012). Application of SelfOrganizing Maps in Text Clustering: A Review. In M. Johnsson
(Ed.),
Applications of Self-Organizing Maps (pp. 205-220): InTech.
https://doi.org/10.5772/50618
Makagonov, P., Espinoza, C., & Sidorov, G. (2011). Document Search
Images in Text Collections for Restricted Domains on Websites. In
R. F. Brena (Ed.),
Quantitative Semantics and Soft Computing
مجلة البحث العلمى فى الآداب العدد العشرون لسنة 9102الجزء الأول
Methods for the Web: Perspectives and Applications: Perspectives
and Applications
(pp. 183-204): IGI Global.
Moisl, H. (2009). Using electronic corpora in historical dialectology
research. In M. Dossena & R. Lass (Eds.),
Studies in English and
European Historical Dialectology
(pp. 68-90.). Brussels;
Frankfurt: Peter Lang.
Nossel, S. (2017). The Problem With Making Hate Speech Illegal.
The
Foreign Policy
. August 14, 2017
Olsson, J. (2008).
Forensic Linguistics: An Introduction To Language,
Crime and the Law
. London: Bloomsbury Publishing.
Olsson, J. (2009).
Word Crime: Solving Crime Through Forensic
Linguistics
. London and New York: Continuum International
Publishing Group.
Ostrowski, D. (2014). Feature Selection for Twitter Classification
IEEE
International Conference on Semantic Computing, 16-18 June
2014
, 267-272.
Savage, K. (2017).
Standing Soldiers, Kneeling Slaves: Race, War, and
Monument in Nineteenth-Century America
: Princeton University
Press.
https://doi.org/10.2307/j.ctt1tg5p86
Schreibman, S., Siemens, R., & Unsworth, J. (2004). A Companion to
Digital Humanities
. Oxford: Blackwell.
Skillicorn, D. B. (2012).
Understanding High-Dimensional Spaces. New
York; London: Springer Science & Business Media.
https://doi.org/10.1007/978-3-642-33398-9
Solan, L. M., & Tiersma, P. M. (2012). The Oxford Handbook of
Language and Law
Osford: Oxford University Press.
Stolberg, S. G., & Rosenthal, B. (2017). Man Charged After White
Nationalist Rally in Charlottesville Ends in Deadly Violence.
The
New York Times
. August 12, 2017
Sutton, M., & Mann, D. (1998). Net Crime: More Change in the
Organisation of Thieving.
British Journal of Criminology, 38(2),
210–229.
Timberg, C., & Harwell, D. (2018). We studied thousands of anonymous
posts about the Parkland attack — and found a conspiracy in the
making.
The Washington Post. February 27, 2018
Villmann, T. (1999).
Benefits and limits of self-organizing map and its
variants in the area of satellite remote sensoring processing.
Paper
presented at the ESANN'1999 proceedings - European Symposium
on Artificial Neural Networks, Bruges (Belgium) 21-23 April 1999
Wall, D. (2003).
Crime and the internet. London: Routledge.
https://doi.org/10.4324/9780203299180