Abstract
This study investigates the robustness and
stability of a likelihood ratio–based (LRbased) forensic text comparison (FTC) system against the size of background population data. Focus is centred on a score-based
approach for estimating authorship LRs.
Each document is represented with a bagof-words model, and the Cosine distance is
used as the score-generating function. A set
of population data that differed in the number of scores was synthesised 20 times using the Monte-Carol simulation technique.
The FTC system’s performance with different population sizes was evaluated by a gradient metric of the log–LR cost (Cllr). The
experimental results revealed two outcomes: 1) that the score-based approach is
rather robust against a small population
size—in that, with the scores obtained from
the 40~60 authors in the database, the stability and the performance of the system
become fairly comparable to the system
with a maximum number of authors (720);
and 2) that poor performance in terms of
Cllr, which occurred because of limited
background population data, is largely due
to poor calibration. The results also indicated that the score-based approach is more
robust against data scarcity than the feature-based approach; however, this finding
obliges further study.
Original language | English |
---|---|
Pages | 1-11 |
Publication status | Published - 2020 |
Event | The Australasian Language Technology Association Workshop 2020 - Virtual Duration: 1 Jan 2020 → … |
Conference
Conference | The Australasian Language Technology Association Workshop 2020 |
---|---|
Period | 1/01/20 → … |