Published

2016-07-01

Bayesian Analysis of the Heterogeneity of Literary Style

Análisis bayesiano de la heterogeneidad del estilo literario

DOI:

https://doi.org/10.15446/rce.v39n2.50151

Keywords:

Authorship, Cluster analysis, Multinomial distribution (en)
Análisi de conglomerados, Atribución, Distribución multinomial. (es)

Downloads

Authors

  • Xavier Puig Universitat Politècnica de Catalunya
  • Marti Font Statistics and O. R. Department, Technical University of Catalonia, Barcelona, Spain
  • Josep Ginebra Statistics and O. R. Department, Technical University of Catalonia, Barcelona, Spain

We proposed statistical analysis of the heterogeneity of literary style in a set of texts that simultaneously use different stylometric characteristics, like word length and the frequency of function words. The data set consists of several tables with the same number of rows, with the i-th row of all tables corresponding to the i-th text. The analysis proposed clusters the rows of all these tables simultaneously into groups with homogeneous style, based on a finite mixture of sets of multinomial models, one set for each table. 

 

Different from the usual heuristic cluster analysis approaches, our method naturally incorporates the text size, the discrete nature of the data, and the dependence between categories in the analysis. The model is checked and chosen with the help of posterior predictive checks, together with the use of closed form expressions for the posterior probabilities that each of the models considered to be appropriate. This is illustrated through an analysis of the heterogeneity in Shakespeare’s plays, and by revisiting the authorshipattribution
problem of Tirant lo Blanc.

Se propone un análisis estadístico para modelar la heterogeneidad del
estilo literario en un conjunto de textos, para ello se utilizan simultáneamente diferentes características estilométricas, como longitud de palabra y la frecuencia de palabras función. Los datos consisten en varias tablas con el mismo número de filas, donde la fila i-ésima corresponde al texto i-ésimo. El análisis propuesto agrupa las filas de todas estas tablas simultáneamente en grupos de estilo homogéneo, en base a una mezcla finita de modelos multinomiales. El modelo propuesto tiene la ventaja sobre los análisis de conglomerados heurísticos habituales, de incorporar de forma natural el tamaño del texto, la naturaleza discreta de los datos y la dependencia entre las categorías. El modelo se selecciona y válida con la ayuda de simulaciones de la distribución predictiva a posteriori, junto con el uso de las expresiones en forma cerrada para la probabilidad a posteriori de cada uno de los modelos de mezcla considerados. Todo ello se ilustra a través de un análisis de la heterogeneidad en las obras de Shakespeare, y revisitando el problema de atribución de autoría del texto Tirant lo Blanc.

https://doi.org/10.15446/rce.v39n2.50151

Bayesian Analysis of the Heterogeneity of Literary Style

Análisis bayesiano de la heterogeneidad del estilo literario

MARTI FONT1, XAVIER PUIG2, JOSEP GINEBRA3

1Technical University of Catalonia, Statistics and O. R. Department, Barcelona, Spain. Professor. Email: marti.font@upc.edu
2Technical University of Catalonia, Statistics and O. R. Department, Barcelona, Spain. Professor. Email: xavier.puig@upc.edu
3Technical University of Catalonia, Statistics and O. R. Department, Barcelona, Spain. Professor. Email: josep.ginebra@upc.edu


Abstract

We proposed statistical analysis of the heterogeneity of literary style in a set of texts that simultaneously use different stylometric characteristics, like word length and the frequency of function words. The data set consists of several tables with the same number of rows, with the i-th row of all tables corresponding to the i-th text. The analysis proposed clusters the rows of all these tables simultaneously into groups with homogeneous style, based on a finite mixture of sets of multinomial models, one set for each table.
Different from the usual heuristic cluster analysis approaches, our method naturally incorporates the text size, the discrete nature of the data, and the dependence between categories in the analysis. The model is checked and chosen with the help of posterior predictive checks, together with the use of closed form expressions for the posterior probabilities that each of the models considered to be appropriate. This is illustrated through an analysis of the heterogeneity in Shakespeares plays, and by revisiting the authorship-attribution problem of Tirant lo Blanc.

Key words: Authorship, Cluster analysis, Multinomial distribution.


Resumen

Se propone un análisis estadístico para modelar la heterogeneidad del estilo literario en un conjunto de textos, para ello se utilizan simultáneamente diferentes características estilométricas, como longitud de palabra y la frecuencia de palabras función. Los datos consisten en varias tablas con el mismo número de filas, donde la fila i-ésima corresponde al texto i-ésimo. El análisis propuesto agrupa las filas de todas estas tablas simultáneamente en grupos de estilo homogéneo, en base a una mezcla finita de modelos multinomiales.
El modelo propuesto tiene la ventaja sobre los análisis de conglomerados heurísticos habituales, de incorporar de forma natural el tamaño del texto, la naturaleza discreta de los datos y la dependencia entre las categorías. El modelo se selecciona y válida con la ayuda de simulaciones de la distribución predictiva a posteriori, junto con el uso de las expresiones en forma cerrada para la probabilidad a posteriori de cada uno de los modelos de mezcla considerados. Todo ello se ilustra a través de un análisis de la heterogeneidad en las obras de Shakespeare, y revisitando el problema de atribución de autoría del texto Tirant lo Blanc.

Palabras clave: análisi de conglomerados, atribución, distribución multinomial.


Texto completo disponible en PDF


References

1. Banfield, J. D. & Raftery, A. E. (1993), 'Model based gaussian and non-gaussian clustering', Biometrics 49, 803-821.

2. Binongo, J. N. G. (1994), 'Joaquin's Joaquinesquerie, Joaquinesqueri's Joaquin: a statistical expression of a Filipino Writer's style', Literary and Linguistic Computing 9, 267-279.

3. Brinegar, C. S. (1963), 'Mark twain and the quintus curtius snodgrass letters: a statistical test of authorship', Journal of the American Statistical Association 58, 85-96.

4. Bruno, A. M. (1974), Toward a Quantitative Methodology for Stylistic Analysis of Narrative Style, University of California Press, Berkeley.

5. Casella, G., Moreno, E. & Giron, J. (2014), 'Cluster analysis, model selection and prior distributions on models', Bayesian Analysis 9, 613-658.

6. Edmondson, P. & Wells, S. (2013), Shakespeare Beyond Doubt: Evidence, Argument, Controversy, Cambridge University Press, Cambridge.

7. Fernandez, C. & Green, P. J. (2002), 'Modelling spatially correlated data via mixtures: a bayesian approach', Journal of the Royal Statistical Society B 64, 805-826.

8. Font, M., Puig, X. & Ginebra, J. (2013), 'A Bayesian analysis of frequency count data', Journal of Statistical Computation and Simulation 83, 229-246.

9. Fraley, C. & Raftery, A. E. (2002), 'Model-based clustering, discriminant analysis and density estimation', Journal of the American Statistical Association 97, 611-631.

10. Gelfand, A. E. & Dey, D. K. (1994), 'Bayesian model choice: asymptotics and exact calculations', Journal of the Royal Statistical Society, Serie B 56, 501-514.

11. Gelman, A., Carlin, J. C., Stern, H. & Rubin, D. B. (2004), Bayesian Data Analysis, 2 edn, Chapman & Hall, New York.

12. Giron, J., Ginebra, J. & Riba, A. (2005), 'Bayesian analysis of a multinomial sequence and homogeneity of literary style', The American Statistician 59, 19-30.

13. Gnanadesikan, R. (1997), Methods of Statistical Data Analysis of Multivariate Observations, 2 edn, Wiley, New York.

14. Gordon, A. D. (1999), Classification, 2 edn, Chapman and Hall, London.

15. Greenacre, M. (2007), Correspondence Analysis in Practice, Chapman and Hall, London.

16. Hilton, M. L. & Holmes, D. I. (1993), 'An assessment of cumulative control charts for authorship-attribution', Literary and Linguistic Computing 8, 73-80.

17. Holmes, D. I. (1985), 'The analysis of literary style, a review', Journal of the Royal Statistical Society, Ser A 148, 328-341.

18. Holmes, D. I. (1992), 'A stylometric analysis of mormon scripture and related texts', Journal of the Royal Statistical Society 155, 91-120.

19. Holmes, D. I. (1994), 'Authorship attribution', Computers and the Humanities 28, 87-106.

20. Holmes, D. I. (1998), 'The evolution of stylometry in humanities scholarship', Literary and Linguistic Computing 13, 111-117.

21. Holmes, D. I. (1999), Stylometry, 'Encyclopedia of Statistical Sciences', Wiley, New York, p. 721-727.

22. Hope, J. (1994), The Authorship of Shakespeare's Plays, Cambridge: Cambridge University Press, Cambridge.

23. Hope, J. (2010), Shakespeare and Language: Reason, Eloquence and Artifice in the Renaissance, The Arden Shakespeare, London.

24. Kaufman, L. & Rousseeuw, P. J. (1990), Finding Groups in Data, Wiley, New York.

25. Lunn, D. J., Jackson, C., Best, N., Thomas, A. & Spiegelhalter, D. (2013), The BUGS Book. A Practical Introduction to Bayesian Analysis, Chapman Hall, London.

26. Luyckx, K. (2010), Scalability Issues in Authorship Attribution, University Press Antwerp, Brussels.

27. Mendenhall, T. C. (1887), 'The characteristic curves of composition', Science 9.

28. Miranda-Garcia, A. & Calle-Martin, J. (2007), 'Function words in authorship attribution studies', Literary and Linguistic Computing 22, 27-47.

29. Morton, A. Q. (1978), Literary Detection, Scribners, New York.

30. Mosteller, F. & Wallace, D. L. (1984), Applied Bayesian and Classical Inference; the Case of The Federalist Papers, 1 and 2 edn, Springer-Verlag, Berlin.

31. Murtagh, F. & Raftery, A. E. (1984), 'Fitting straight lines to point patterns', Pattern Recognition 17, 479-483.

32. Oakes, M. P. (1998), Statistics for Corpus Linguistics, Edimburgh University Press, Edimburg.

33. Puig, X., Font, M. & Ginebra, J. (2015), 'Classification of literary style that takes order into consideration', Journal of Quantitative Linguistics 22, 177-201.

34. Puig, X., Font, M. & Ginebra, J. (2016), 'A unified approach to authorship attribution and verification', To appear in The American Statistician.

35. Puig, X. & Ginebra, J. (2014), 'A bayesian cluster analysis of election results', Journal of Applied Statistics 41, 73-94.

36. Riba, A. & Ginebra, J. (2005), 'Change-point estimation in a multinomial sequence and homogeneity of literary style', Journal of Applied Statistics 32, 61-74.

37. Riba, A. & Ginebra, J. (2006), 'Diversity of vocabulary and homogeneity of literary style', Journal of Applied Statistics 33, 729-741.

38. Rybicki, J. & Eder, M. (2011), 'Deeper Delta across genres and languages: do we really need the most frequent words?', Literary and Linguistic Computing 26, 315-321.

39. Shahan, J. M. & Waugh, A. (2013), Shakespeare Beyond Doubt? Exposing and Industry in Denial, Llumina Press, London.

40. Smith, M. W. A. (1983), 'Recent experience and new developments of methods for the determination of authorship', Association for Literary and Linguistic Computing Bulletin 11, 73-82.

41. Stamatatatos, E. (2009), 'A survey of modern authorship attribution methods', Journal of the American Society of Information Science and Technology 60, 538-556.

42. Williams, C. B. (1975), 'Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon', Biometrika 62, 207-212.

43. Zhao, Y. & Zobel, J. (2005), 'Effective and scalable authorship attribution using function words', Information Retrieval Technology 3689, 174-189.


[Recibido en abril de 2015. Aceptado en enero de 2016]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv39n2a04,
    AUTHOR  = {Font, Marti and Puig, Xavier and Ginebra, Josep},
    TITLE   = {{Bayesian Analysis of the Heterogeneity of Literary Style}},
    JOURNAL = {Revista Colombiana de Estadística},
    YEAR    = {2016},
    volume  = {39},
    number  = {2},
    pages   = {205-227}
}

References

Banfield, J. D. & Raftery, A. E. (1993), ‘Model based gaussian and non-gaussian clustering’, Biometrics 49, 803–821.

Binongo, J. N. G. (1994), ‘Joaquin’s Joaquinesquerie, Joaquinesqueri’s Joaquin: a statistical expression of a Filipino Writer’s style’, Literary and Linguistic Computing 9, 267–279.

Brinegar, C. S. (1963), ‘Mark twain and the quintus curtius snodgrass letters: A statistical test of authorship’, Journal of the American Statistical Association 58, 85–96.

Bruno, A. M. (1974), Toward a Quantitative Methodology for Stylistic Analysis of Narrative Style, University of California Press, Berkeley.

Casella, G., Moreno, E. & Giron, J. (2014), ‘Cluster analysis, model selection and prior distributions on models’, Bayesian Analysis 9, 613–658.

Edmondson, P. & Wells, S. (2013), Shakespeare Beyond Doubt: Evidence, Argument, Controversy, Cambridge University Press, Cambridge.

Fernandez, C. & Green, P. J. (2002), ‘Modelling spatially correlated data via mixtures: a bayesian approach’, Journal of the Royal Statistical Society B 64, 805–826.

Font, M., Puig, X. & Ginebra, J. (2013), ‘A Bayesian analysis of frequency count data’, Journal of Statistical Computation and Simulation 83, 229–246.

Fraley, C. & Raftery, A. E. (2002), ‘Model-based clustering, discriminant analysis and density estimation’, Journal of the American Statistical Association 97, 611–631.

Gelfand, A. E. & Dey, D. K. (1994), ‘Bayesian model choice: Asymptotics and exact calculations’, Journal of the Royal Statistical Society, Serie B 56, 501– 514.

Gelman, A., Carlin, J. C., Stern, H. & Rubin, D. B. (2004), Bayesian Data Analysis, 2 edn, Chapman & Hall, New York.

Giron, J., Ginebra, J. & Riba, A. (2005), ‘Bayesian analysis of a multinomial sequence and homogeneity of literary style’, The American Statistician 59, 19–30.

Gnanadesikan, R. (1997), Methods of Statistical Data Analysis of Multivariate Observations, 2 edn, Wiley, New York.

Gordon, A. D. (1999), Classification, 2 edn, Chapman and Hall, London.

Greenacre, M. (1988), ‘Clustering the rows and columns of a contingency table’, Journal of Classification 5, 39–51.

Greenacre, M. (2007), Correspondence Analysis in Practice, Chapman and Hall, London.

Hilton, M. L. & Holmes, D. I. (1993), ‘An assessment of cumulative control charts for authorship-attribution’, Literary and Linguistic Computing 8, 73–80.

Holmes, D. I. (1985), ‘The analysis of literary style, a review’, Journal of the Royal Statistical Society, Ser A 148, 328–341.

Holmes, D. I. (1992), ‘A stylometric analysis of mormon scripture and related texts’, Journal of the Royal Statistical Society 155, 91–120.

Holmes, D. I. (1994), ‘Authorship attribution’, Computers and the Humanities 28, 87–106.

Holmes, D. I. (1998), ‘The evolution of stylometry in humanities scholarship’, Literary and Linguistic Computing 13, 111–117.

Holmes, D. I. (1999), Stylometry, in ‘Encyclopedia of Statistical Sciences’, Wiley, New York, pp. 721–727.

Hope, J. (1994), The Authorship of Shakespeare’s Plays, Cambridge: Cambridge University Press, Cambridge.

Hope, J. (2010), Shakespeare and Language: Reason, Eloquence and Artifice in the Renaissance, The Arden Shakespeare, London.

Kaufman, L. & Rousseeuw, P. J. (1990), Finding Groups in Data, Wiley, New York.

Lunn, D. J., Jackson, C., Best, N., Thomas, A. & Spiegelhalter, D. (2013), The BUGS Book. A Practical Introduction to Bayesian Analysis, Chapman Hall, London.

Luyckx, K. (2010), Scalability Issues in Authorship Attribution, University Press Antwerp, Brussels.

Mendenhall, T. C. (1887), ‘The characteristic curves of composition’, Science 9.

Mendenhall, T. C. (1901), ‘A mechanical solution of a literary problem’, The Popular Science Monthly 60.

Miranda-Garcia, A. & Calle-Martin, J. (2007), ‘Function words in authorship attribution studies’, Literary and Linguistic Computing 22, 27–47.

Morton, A. Q. (1978), Literary Detection, Scribners, New York.

Mosteller, F. & Wallace, D. L. (1984), Applied Bayesian and Classical Inference; the Case of The Federalist Papers, 1 and 2 edn, Springer-Verlag, Berlin.

Murtagh, F. & Raftery, A. E. (1984), ‘Fitting straight lines to point patterns’, Pattern Recognition 17, 479–483.

Oakes, M. P. (1998), Statistics for Corpus Linguistics, Edimburgh University Press, Edimburg.

Puig, X., Font, M. & Ginebra, J. (2015), ‘Classification of literary style that takes order into consideration’, Journal of Quantitative Linguistics 22, 177–201.

Puig, X., Font, M. & Ginebra, J. (2016), ‘A unified approach to authorship attribution and verification’, To appear in The American Statistician.

Puig, X. & Ginebra, J. (2014), ‘A bayesian cluster analysis of election results’, Journal of Applied Statistics 41, 73–94.

Riba, A. & Ginebra, J. (2005), ‘Change-point estimation in a multinomial sequence and homogeneity of literary style’, Journal of Applied Statistics 32, 61–74.

Riba, A. & Ginebra, J. (2006), ‘Diversity of vocabulary and homogeneity of literary style’, Journal of Applied Statistics 33, 729–741.

Rybicki, J. & Eder, M. (2011), ‘Deeper Delta across genres and languages: do we really need the most frequent words?’, Literary and Linguistic Computing 26, 315–321.

Shahan, J. M. & Waugh, A. (2013), Shakespeare Beyond Doubt? Exposing and Industry in Denial, Llumina Press, London.

Smith, M. W. A. (1983), ‘Recent experience and new developments of methods for the determination of authorship’, Association for Literary and Linguistic Computing Bulletin 11, 73–82.

Stamatatatos, E. (2009), ‘A survey of modern authorship attribution methods’, Journal of the American Society of Information Science and Technology 60, 538–556

Williams, C. B. (1975), ‘Mendenhall’s studies of word-length distribution in the works of Shakespeare and Bacon’, Biometrika 62, 207–212.

Zhao, Y. & Zobel, J. (2005), ‘Effective and scalable authorship attribution using function words’, Information Retrieval Technology 3689, 174–189.

How to Cite

APA

Puig, X., Font, M. and Ginebra, J. (2016). Bayesian Analysis of the Heterogeneity of Literary Style. Revista Colombiana de Estadística, 39(2), 205–227. https://doi.org/10.15446/rce.v39n2.50151

ACM

[1]
Puig, X., Font, M. and Ginebra, J. 2016. Bayesian Analysis of the Heterogeneity of Literary Style. Revista Colombiana de Estadística. 39, 2 (Jul. 2016), 205–227. DOI:https://doi.org/10.15446/rce.v39n2.50151.

ACS

(1)
Puig, X.; Font, M.; Ginebra, J. Bayesian Analysis of the Heterogeneity of Literary Style. Rev. colomb. estad. 2016, 39, 205-227.

ABNT

PUIG, X.; FONT, M.; GINEBRA, J. Bayesian Analysis of the Heterogeneity of Literary Style. Revista Colombiana de Estadística, [S. l.], v. 39, n. 2, p. 205–227, 2016. DOI: 10.15446/rce.v39n2.50151. Disponível em: https://revistas.unal.edu.co/index.php/estad/article/view/50151. Acesso em: 16 apr. 2024.

Chicago

Puig, Xavier, Marti Font, and Josep Ginebra. 2016. “Bayesian Analysis of the Heterogeneity of Literary Style”. Revista Colombiana De Estadística 39 (2):205-27. https://doi.org/10.15446/rce.v39n2.50151.

Harvard

Puig, X., Font, M. and Ginebra, J. (2016) “Bayesian Analysis of the Heterogeneity of Literary Style”, Revista Colombiana de Estadística, 39(2), pp. 205–227. doi: 10.15446/rce.v39n2.50151.

IEEE

[1]
X. Puig, M. Font, and J. Ginebra, “Bayesian Analysis of the Heterogeneity of Literary Style”, Rev. colomb. estad., vol. 39, no. 2, pp. 205–227, Jul. 2016.

MLA

Puig, X., M. Font, and J. Ginebra. “Bayesian Analysis of the Heterogeneity of Literary Style”. Revista Colombiana de Estadística, vol. 39, no. 2, July 2016, pp. 205-27, doi:10.15446/rce.v39n2.50151.

Turabian

Puig, Xavier, Marti Font, and Josep Ginebra. “Bayesian Analysis of the Heterogeneity of Literary Style”. Revista Colombiana de Estadística 39, no. 2 (July 1, 2016): 205–227. Accessed April 16, 2024. https://revistas.unal.edu.co/index.php/estad/article/view/50151.

Vancouver

1.
Puig X, Font M, Ginebra J. Bayesian Analysis of the Heterogeneity of Literary Style. Rev. colomb. estad. [Internet]. 2016 Jul. 1 [cited 2024 Apr. 16];39(2):205-27. Available from: https://revistas.unal.edu.co/index.php/estad/article/view/50151

Download Citation

CrossRef Cited-by

CrossRef citations1

1. Karina Gibert, Yaroslav Hernandez-Potiomkin. (2023). A Unified Formal Framework for Factorial and Probabilistic Topic Modelling. Mathematics, 11(20), p.4375. https://doi.org/10.3390/math11204375.

Dimensions

PlumX

Article abstract page views

1636

Downloads

Download data is not yet available.