Reflected Text Analysis beyond Linguistics
From September 9 to 13, I will be giving a class on Reflected Text Analysis beyond Linguistics, as part of the DGfS-CL fall school 2019 at the IMS at Stuttgart University. The class is also part of the CRETA Coaching.
This post serves as course page, containing the material, agenda etc.
Agenda
Day | 14:00-15:30 | 16:00-17:30 | |
---|---|---|---|
Monday | Introduction, Overview, Annotation | ☕ | Annotation exercise, Inter-Annotator Agreement |
Tuesday | Machine learning overview and evaluation, algorithms | ☕ | Algorithms |
Wednesday | Introduction into shared task, hands on session | ☕ | Hands on session |
Thursday | excursion to the German Literature Archive, Marbach (starting at 1pm!) |
||
Friday | Hands on session, shared task evaluation | ☕ | What to do next, closing discussion |
Material
Participants are asked to install the following things on their computers (this can be done during the first day of the class)
Python
- Python: If your computer already has Python 2, there is no need to update. If you’re installing Python from scratch, please use Python 3.
- pip: The Python package manager
- The Python libraries
nltk
andrequests
.
Detailed instructions for Windows, Mac OS X and Linux can be found here (PDF file). The file test_install.py can be used to test the installation.
Text Editor
For editing Python files, participants will need a plain text editor. We recommend the following:
Slides
Monday
- Slides
- Example annotation guidelines: STTS tag set (German parts of speech), Penn Treebank tag set (English parts of speech)
- Texts for annotation exercise: Lewis Carroll: Alice in Wonderland, chapter 11, Jules Verne: Around the World in 80 Days, chapter 13, Mary Rowlandson: Narrative of the Captivity and Restoration of Mrs. Mary Rowlandson
Tuesday
Wednesday
- Slides on shared tasks and hackatorial
- Hackatorial package: Please download the zip file and extract it into a directory on your drive. The zip file contains
- Data with annotated entity references (sub directory
data
) - Code for training, testing and uploading (sub directory
code
) - Resources used for feature extraction (sub directory
static
)
- Data with annotated entity references (sub directory
- List of implemented features
Friday
- Slides on Hackatorial evaluation
- Slides on what to do next
- Hackatorial results
Projects (for ECTS credit points)
If you’re interested in getting ECTS credit points for taking part in this class, you’ll need to conduct a small project, according to the following recipe (unless we agreed on a different plan):
- Pick a task (e.g., part of speech tagging)
- Pick a non-standard text that is not too long (e.g., a poem)
- Create a gold standard by applying the annotation guidelines for the task
- Apply an existing tool for the task
- Evaluate the tool against your annotations
- Either
- Develop hypotheses for improving/adapting the tool or
- Retrain the tool on existing training data and your own corpus
- Re-evaluate it after adding your own data
- Write a brief report on this and send it to me
Your project should be finished (and the report sent to me) before October 14.