Abstract
The rotated delta, which is argued to be a theoretically better-grounded distance measure, has failed to receive any empirical support for its superiority. This study revisits the rotated delta-which is more commonly known as the Mahalanobis distance in other areas-with two different covariance matrices that are estimated from training data. The first covariance matrix represents the between-author variability, and the second the within-author variability. A series of likelihood ratio-based authorship verification experiments was carried out with some different distance measures. The experiments made use of the documents arranged from a large database of text messages that allowed for a total of 2,160 same-author and 4,663,440 different-author comparisons. The Mahalanobis distance with the between-author covariance matrix performed far worse compared to the other distance measures, whereas the Mahalanobis distance with the within-author covariance matrix performed better than the other measures. However, superior performance relative to the cosine distance is subject to word lengths and/or the order of the feature vector. The result of follow-up experiments further illustrated that the covariance matrix representing the within-author variability needs to be trained using a good amount of data to perform better than the cosine distance: the higher the order of the vector, the more data are required for training. The quantitative results also infer that the two sources of variabilities-notably within- and between-author variabilities-are independent of each other to the extent that the latter cannot accurately approximate the former.
Original language | English |
---|---|
Pages (from-to) | 1051–1072 |
Journal | Digital Scholarship in the Humanities |
Volume | 37 |
Issue number | 4 |
DOIs | |
Publication status | Published - 2022 |