This study is an investigation into the effect of within-speaker sample size (token number) on a likelihood-ratio (LR) based forensic voice comparison (FVC) system. In particular, this study looks into how and to what extent the degree of mismatch in token number between the test/development databases and the background database would affect the performance of the FVC system, using spectral feature vectors extracted from spontaneous Japanese speech. For this purpose, the Monte Carlo simulation technique was utilised to carry out a series of experiments. LRs were estimated using the multivariate kernel density formula, the outcomes of which were calibrated using the logistic-regression calibration technique. The performance of the FVC system was assessed in terms of the log-likelihood-ratio cost (Cllr). It is demonstrated in this paper that (i) regardless of the token number in the test/development databases, the system generally performs better with more tokens in the background database, but with six or more tokens in the background database, the improvement in performance is marginal, if at all, and that (ii) having only two tokens in the background database brings down the system performance considerably when there are four or more tokens in the test/development databases. Some implications of the results are discussed.