NLTK - Natural Language ToolKit
NLTK/Text Processing uses:
install nltk:
python3 -m pip install nltk
http://www.nltk.org/
Getting started with nltk:
>>> import nltk
>>> nltk.download()
showing info http://www.nltk.org/nltk_data/
- Document Similarity detection (ex. Scientific document searches "I want to know who in my company worked on a similar concept I'm working on right now")
- Sentiment Analysis (ex. Are most tweets about the president of USA positive or negative?)
- Document Clustering/Categorizing
- Phrase/Meaning Extraction (http://www.wolframalpha.com/),
- Summarization (http://summly.com/index.html)
install nltk:
python3 -m pip install nltk
http://www.nltk.org/
Getting started with nltk:
>>> import nltk
>>> nltk.download()
showing info http://www.nltk.org/nltk_data/
Import book data:
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
You can now reference it:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>>
text1.concordance("monstrous")
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
You can now reference it:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>>
text1.concordance("monstrous")
TRYIT: Try your own searches, ex. flower, elephant
Sentence Tokenizing:
>>> import nltk
>>> my_str='''
... Keep looking up! I learn from the past, dream about the future and look up. There's nothing like a beautiful sunset to end a healthy day.
... '''
>>> nltk.sent_tokenize(my_str)
['\nKeep looking up!', 'I learn from the past, dream about the future and look up.', "There's nothing like a beautiful sunset to end a healthy day.\n"]
Tokenizing words:
>>> nltk.word_tokenize(my_str)
['Keep', 'looking', 'up', '!', 'I', 'learn', 'from', 'the', 'past', ',', 'dream', 'about', 'the', 'future', 'and', 'look', 'up', '.', 'There', "'s", 'nothing', 'like', 'a', 'beautiful', 'sunset', 'to', 'end', 'a', 'healthy', 'day', '.']
Part of Speech Tagging
tagging a word with its lexical meaning
>>> text = nltk.word_tokenize(my_str)
>>> nltk.pos_tag(text)
[('Keep', 'NNP'), ('looking', 'VBG'), ('up', 'RP'), ('!', '.'), ('I', 'PRP'), ('learn', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('past', 'JJ'), (',', ','), ('dream', 'NN'), ('about', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('look', 'VB'), ('up', 'RP'), ('.', '.'), ('There', 'EX'), ("'s", 'VBZ'), ('nothing', 'NN'), ('like', 'IN'), ('a', 'DT'), ('beautiful', 'JJ'), ('sunset', 'NN'), ('to', 'TO'), ('end', 'VB'), ('a', 'DT'), ('healthy', 'JJ'), ('day', 'NN'), ('.', '.')]
you can lookup the meaning of the tag using help.upenn_tagset
>>> nltk.help.upenn_tagset('NNP')
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
>>> nltk.help.upenn_tagset('VBG')
VBG: verb, present participle or gerund
telegraphing stirring focusing angering judging stalling lactating
hankerin' alleging veering capping approaching traveling besieging
encrypting interrupting erasing wincing ...
Chunking
Chunking builds lexical analysis tree that you can later process to extract the information you need
>>> tag = nltk.pos_tag(text)
>>> nltk.ne_chunk(tag)
Tree('S', [('Keep', 'NNP'), ('looking', 'VBG'), ('up', 'RP'), ('!', '.'), ('I', 'PRP'), ('learn', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('past', 'JJ'), (',', ','), ('dream', 'NN'), ('about', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('look', 'VB'), ('up', 'RP'), ('.', '.'), ('There', 'EX'), ("'s", 'VBZ'), ('nothing', 'NN'), ('like', 'IN'), ('a', 'DT'), ('beautiful', 'JJ'), ('sunset', 'NN'), ('to', 'TO'), ('end', 'VB'), ('a', 'DT'), ('healthy', 'JJ'), ('day', 'NN'), ('.', '.')])
>>> import nltk
>>> for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("His Name was Charlie Chaplin!"))):
... print(chunk)
...
('His', 'PRP$')
('Name', 'NNP')
('was', 'VBD')
(PERSON Charlie/NNP Chaplin/NNP)
('!', '.')
>>> for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("I work in GE located in Fairfield, Connecticut"))):
... print(chunk)
...
('I', 'PRP')
('work', 'VBP')
('in', 'IN')
(ORGANIZATION GE/NNP)
('located', 'VBD')
('in', 'IN')
(GPE Fairfield/NNP)
(',', ',')
(GSP Connecticut/NNP)
Extracting Named Entities with Python3 (Note: most nltk examples on the web are using python2 which will not run on python3, these examples target python3)
Sentence Tokenizing:
>>> import nltk
>>> my_str='''
... Keep looking up! I learn from the past, dream about the future and look up. There's nothing like a beautiful sunset to end a healthy day.
... '''
>>> nltk.sent_tokenize(my_str)
['\nKeep looking up!', 'I learn from the past, dream about the future and look up.', "There's nothing like a beautiful sunset to end a healthy day.\n"]
Tokenizing words:
>>> nltk.word_tokenize(my_str)
['Keep', 'looking', 'up', '!', 'I', 'learn', 'from', 'the', 'past', ',', 'dream', 'about', 'the', 'future', 'and', 'look', 'up', '.', 'There', "'s", 'nothing', 'like', 'a', 'beautiful', 'sunset', 'to', 'end', 'a', 'healthy', 'day', '.']
Part of Speech Tagging
tagging a word with its lexical meaning
>>> text = nltk.word_tokenize(my_str)
>>> nltk.pos_tag(text)
[('Keep', 'NNP'), ('looking', 'VBG'), ('up', 'RP'), ('!', '.'), ('I', 'PRP'), ('learn', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('past', 'JJ'), (',', ','), ('dream', 'NN'), ('about', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('look', 'VB'), ('up', 'RP'), ('.', '.'), ('There', 'EX'), ("'s", 'VBZ'), ('nothing', 'NN'), ('like', 'IN'), ('a', 'DT'), ('beautiful', 'JJ'), ('sunset', 'NN'), ('to', 'TO'), ('end', 'VB'), ('a', 'DT'), ('healthy', 'JJ'), ('day', 'NN'), ('.', '.')]
you can lookup the meaning of the tag using help.upenn_tagset
>>> nltk.help.upenn_tagset('NNP')
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
>>> nltk.help.upenn_tagset('VBG')
VBG: verb, present participle or gerund
telegraphing stirring focusing angering judging stalling lactating
hankerin' alleging veering capping approaching traveling besieging
encrypting interrupting erasing wincing ...
Chunking
Chunking builds lexical analysis tree that you can later process to extract the information you need
>>> tag = nltk.pos_tag(text)
>>> nltk.ne_chunk(tag)
Tree('S', [('Keep', 'NNP'), ('looking', 'VBG'), ('up', 'RP'), ('!', '.'), ('I', 'PRP'), ('learn', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('past', 'JJ'), (',', ','), ('dream', 'NN'), ('about', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('look', 'VB'), ('up', 'RP'), ('.', '.'), ('There', 'EX'), ("'s", 'VBZ'), ('nothing', 'NN'), ('like', 'IN'), ('a', 'DT'), ('beautiful', 'JJ'), ('sunset', 'NN'), ('to', 'TO'), ('end', 'VB'), ('a', 'DT'), ('healthy', 'JJ'), ('day', 'NN'), ('.', '.')])
>>> import nltk
>>> for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("His Name was Charlie Chaplin!"))):
... print(chunk)
...
('His', 'PRP$')
('Name', 'NNP')
('was', 'VBD')
(PERSON Charlie/NNP Chaplin/NNP)
('!', '.')
>>> for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("I work in GE located in Fairfield, Connecticut"))):
... print(chunk)
...
('I', 'PRP')
('work', 'VBP')
('in', 'IN')
(ORGANIZATION GE/NNP)
('located', 'VBD')
('in', 'IN')
(GPE Fairfield/NNP)
(',', ',')
(GSP Connecticut/NNP)
Extracting Named Entities with Python3 (Note: most nltk examples on the web are using python2 which will not run on python3, these examples target python3)
import nltk def extract_entities(text): for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))): if(isinstance(chunk, nltk.tree.Tree)): for subtree in chunk.subtrees(filter=lambda t: t.label() == 'PERSON'): for leave in subtree.leaves(): print(leave[0]) s="His name was Charlie Chaplin!" extract_entities(s) Output: $ python3 nltk_name_extract.py Charlie Chaplin Another example for: s="Who is Gula Nurmatova?" extract_entities(s) $ python3 nltk_name_extract.py Gula Nurmatova
Stopwords
high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing.
Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.
>>> from nltk.corpus import stopwords
>>> stop_words = stopwords.words('english')
>>> stop_words
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Names
>>> from nltk.corpus import names
>>> male_names = names.words('male.txt')
>>> male_names
['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim', 'Abdullah', 'Abe', 'Abel', 'Abelard', 'Abner', 'Abraham', 'Abram', 'Ace', 'Adair', 'Adam', 'Adams', 'Addie', 'Adger', 'Aditya', 'Adlai', 'Adnan', 'Adolf', 'Adolfo', 'Adolph', 'Adolphe', 'Adolpho', 'Adolphus', 'Adrian', 'Adrick', 'Adrien', 'Agamemnon', 'Aguinaldo', 'Aguste', 'Agustin', 'Aharon', 'Ahmad', 'Ahmed', 'Ahmet', 'Ajai', 'Ajay', 'Al', 'Alaa', 'Alain', 'Alan', 'Alasdair', 'Alastair', 'Albatros', 'Albert', 'Alberto', 'Albrecht', 'Alden', 'Aldis', 'Aldo', 'Aldric', 'Aldrich', 'Aldus', 'Aldwin', 'Alec',...
high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing.
Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.
>>> from nltk.corpus import stopwords
>>> stop_words = stopwords.words('english')
>>> stop_words
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Names
>>> from nltk.corpus import names
>>> male_names = names.words('male.txt')
>>> male_names
['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim', 'Abdullah', 'Abe', 'Abel', 'Abelard', 'Abner', 'Abraham', 'Abram', 'Ace', 'Adair', 'Adam', 'Adams', 'Addie', 'Adger', 'Aditya', 'Adlai', 'Adnan', 'Adolf', 'Adolfo', 'Adolph', 'Adolphe', 'Adolpho', 'Adolphus', 'Adrian', 'Adrick', 'Adrien', 'Agamemnon', 'Aguinaldo', 'Aguste', 'Agustin', 'Aharon', 'Ahmad', 'Ahmed', 'Ahmet', 'Ajai', 'Ajay', 'Al', 'Alaa', 'Alain', 'Alan', 'Alasdair', 'Alastair', 'Albatros', 'Albert', 'Alberto', 'Albrecht', 'Alden', 'Aldis', 'Aldo', 'Aldric', 'Aldrich', 'Aldus', 'Aldwin', 'Alec',...
Additional resources:
crossroads.pdf |