Text analysis methods based on word co-occurrence have yielded useful results in humanities and social sciences research. For instance, Venturini et al., (2012) describe the use of concept co-occurrence networks in social sciences. Grimmer and Stewart (2013) survey clustering and topic modeling applied to political science corpora. Whereas these methods provide a useful overview of a corpus, they cannot determine the predicates relating co-occurring elements with each other. For instance, if France and the phrase binding commitments co-occur within a sentence, how are both elements related ? Is France in favour of, or against binding commitments ?
Our system identifies points supported and opposed by negotiating actors and extracts keyphrases and DBpedia 3 concepts from those points. The results are displayed on an interface, allowing for a comparison of different actors’ positions. The system helps address a current need in digital humanities : tools for the quantitative analysis of textual structures beyond word co-occurrence.
Title : Mapping the Bentham Corpus
Authors : Estelle Tieberghien, Pablo Ruiz Fabo, Frédérique Mélanie-Bécquet, Thierry Poibeau, Tim Causer, Melissa Terras
Category : Paper:Long Paper
Keywords : Betham, corpus visualization, knowledge discovery
http://apps.lattice.cnrs.fr/benthamdev/about.html (site de développement, qui sera transféré sur un serveur approprié avant Innovative Big Data)
The exploration of large corpora in the Humanities is a known problem for today’s scholars. For example, the recent PoliInformatics challenge addressed the issue by promoting a framework to develop new and original research in text-rich domains (the project focused on political science but can be extended to any sub-field within the Humanities). Specific experiments have recently been done in the field of philosophy, but they mainly concern the analysis of metadata, like indexes or references (Lamarra and Tardella, 2014 ; Sula and Dean, 2014). Different experiments have nevertheless involved an exploration of large amounts of textual data (see e.g. Diesner and Carley, 2005 on the Enron corpus) with relevant visualization interfaces (Yanhua et al., 2009).In this demonstration, we propose to explore more advanced natural language processing techniques to extract keywords and filter them according to an external ontology, so as to obtain a more relevant indexing of the documents before visualization. We also explore dynamic representations, which were not addressed in the above-mentioned studies.
Le LATTICE peut aussi proposer des outils d’analyse de base, comme un analyseur morphosyntaxique état de l’art pour le français (SEM Taggeur).