[Logo of E] [Logo of DHBW]

Machine Learning for Textual Analysis

Stephan Schulz

Historical texts often have an interesting and sometimes controversial history and origin. Examples include e.g. the contentious authorship of the works of William Shakespeare, the authorship of the Federalist Papers during the founding of the United States, or the history of forged documents like the Donatio Constantini. Similar problems occur when discussing the provenance of source code, e.g. to attribute authorship of malware or to determine original authorship and copyright status, as during the SCO/IBM Linux dispute.

In this series of projects, we will try to use modern machine learning techniques and libraries to extract important properties of texts, to identify similarities and differences, and possibly even to ascribe authorship.

In particular we will explore how far modern standard machine learning approaches (clustering, decision trees, multi-layer perceptrons, deep learning) from standard libraries such as TensorFlow and scikit-learn can be used to retrieve known relationships between texts and to maybe even to test and support different hypotheses.

As an example text, we will use the Bible in the German Luther translation and the English King James edition. The Bible is a collection of historical texts with very different backgrounds and a large body of existing analysis to establish a certain degree of ground truth. As an example, most of the Old Testament was originally written in Classical Hebrew, while most of the New Testament was written in Koine Greek. In the Old Testament, the documentary hypothesis claims different authorship for different parts of the Pentateuch. Within the New Testament, the three synoptic gospels (Matthew, Mark and Luke) seem to have a shared history with common borrowings, while the Gospel of John is a largely independent creation. Similarly, some of the Letters of Paul are believed to be authored by the apostle Paul, while others are pseudoepigraphic, and still others have an unclear provenance.

Example questions to analyze include:

From the computer science side of the project, relevant questions include:

Subtasks include the following:

Depending on the outcome, we will try to publish some of the results at a suitable AI conference. Also, experiences may be used to design a new course on machine learning and automated text analysis.

Literature

Abadi, Martin, Ashish Agarwal, Paul Barham, Eugene Brevdo, Chen Zhifeng, Craig Citro, Greg S. Corrado, et al. 2016. “Tensorflow: Large-scale machine learning on heterogeneous distributed systems.” arXiv Preprint arXiv:1603.04467. https://arxiv.org/abs/1603.04467.

Conneau, Alexis, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. “Very Deep Convolutional Networks for Text Classification.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1:1107–16. http://www.aclweb.org/anthology/E17-1104.

Efstathios, Stamatatos. 2009. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60 (3): 538–56. https://www.clips.uantwerpen.be/stylometry/Lit/Stamatatos_survey2009.pdf.

Ehrman, Bart D. 2015. The New Testament: A Historical Introduction to the Early Christian Writings. 6th ed. Oxford University Press.

Forman, George. 2003. “An Extensive Empirical Study of Feature Selection Metrics for Text Classification.” Journal of Machine Learning Research 3 (Mar): 1289–1305. http://www.jmlr.org/papers/volume3/forman03a/forman03a.pdf.

Hayes, Christine. 2012. Introduction to the Bible. Yale University Press.

Iqbal, Farkhund, Hamad Binsalleeh, Benjamin C.M. Fung, and Mourad Debbabi. 2013. “A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications.” Information Sciences 231: 98–112. https://spectrum.library.concordia.ca/976945/1/fung2011b.pdf.

Martin, Dale B. 2012. New Testament History and Literature. Yale University Press.

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12 (Oct): 2825–30. http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.

Weblinks


DHBW Stuttgart, Prof. Dr. Stephan Schulz