Workshop Outline and Notes for Text Analysis with NLTK
Many of the exercises are from the NLTK Book, which is a great introduction to text analysis
Major steps in doing language analysis
This session focuses on Turning the data into numbers. There are lots of ways to do this, both programmatic and non-programmatic.
Q--> What parts of language (spoken, written, texted) can you count? What numbers can you come up with beyond counting?
Before we cover this, we have to talk about data types
my_string = "I am a Digital Researcher!"
my_list = [] (makes an empty list)
love_list = ['love', 'hope', 'joy', 'amor']
my_dict = {} (makes an empty dictionary)
NLTK Functions
Very Helpful CheatsheetCheatsheet from Princeton
import nltk
from import *
i. Concordance - shows the context that the word occurs in
ii. Similar - shows words that appear near similar words **very illuminating if looking at 2 different texts - how does one person use "love" versus another"?
iii. Common contexts - used to compare two words - in what environments do they both occur?
Q--> The syntax changed here - we had to use brackets when we gave it the words to look for - WHY?!?
text1.common_contexts([WORD, WORD])
iv. lexical dispersion plot - good for plotting word use over time or throughout the course of a book
** A pop-up will appear with the dispersion plot. You can save this if you want. This MUST BE CLOSED TO MOVE ON
v. Count a specific word - how many times does this sequence of characters occur in my document?
vi. Count tokens - tokens are sequences (i.e., words and punctuation, so "love", "bowie", "Bowie", "!" and ":)" are tokens)
vi. Count unique words - first have to make a set that groups all the "words" together (numbers, punctuation sequences, etc.) - this groups together types. Token = instance, Type = more general ("bowie" and "Bowie" are different types)
(you might want to sort this set if you want to organize it)
vii. Lexical Density! We have a number!! The number of unique tokens divided by the total number of words. This is a descriptive measure of language register.
viii. Frequency Distribution! We have another number!! 1. first make an object that Python can look at:
my_dist = FreqDist(text1)
I like to check if its there
my_dist.most_common(100) - gives the top 100 words and their numbers
We could go on and on with things you can do.
c. Pythonic non-NLTK Functions
Let's say I am interested in words that I think are related to love and I want to check if they occur in the ii. Intro sentiment analysis - the love text example
love_list = []
love_words = ['love', 'joy', 'hope', 'amor']
for word in love_words:
if word in text1:
d. Regular Expressions - used for pattern matching and data cleaning. It is worth mention, but not going to discuss.
b. Making an NLTK Text i. From the internet - We're going to use Don Quixote from the Gutenberg Project
from urllib.request import urlopen
url = u""
raw = urlopen(url).read()
raw = raw.decode("utf-8", "ignore")
That is all that you SHOULD have to do, but if you just do this, you will get an ASCII Error. Python can't deal with non-English characters, so this is a little piece of code to fix it. For more on that and how it relates to Python, visit the Python docs. For the purpose of this workshop, put in this last piece of code. Check to be sure it's right
ii. Now need to break this giant string into a list of things we recognize - tokens (words, punctuation, etc.)
tokens = nltk.word_tokenize(raw)
iii. this makes a list of tokens - let's check to make sure its correct
iv. This is enough to read it in to make your own files, but to use the NLTK features, need to make it an NLTK Text
dq_text = nltk.Text(tokens)
v. Check to be sure it worked
vi. Now let's use our favorite new techniques.
vii. From your own files exactly the same, just have to read the file in.
infile = open("PATH", "r")
my_file =
From here, the whole word is open to you... For more information and ways to work with text, refer to the NLTK Book! If you start working in Python to do text analysis, come to the PUG or Office Hours - there is so much more to learn!!