Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
566 views
ubuntu2004
Kernel: Python 3 (system-wide)

COSC - Project 02

Kyle Anderson

Introduction

The goal of this project is to develop tools for performing basic text analysis tasks, such as processing text read from a text file, tokenizing the text, and performing word counts.

%run -i word_count.py

"Testing the process_word() Function

We will testing the function using a few test strings.

print(process_word('Test!')) print(process_word('WHAT?:?:?')) print(process_word(':_(:;hEl\'lo,)123'))
test what hel lo

Testing the process_line() Function

We will use a test string to test the process_line() function.

print(process_line('This is a test string". It\'s fifty-eight characters long!'))
['this', 'is', 'a', 'test', 'string ', 'it s', 'fifty', 'eight', 'characters', 'long ']

Processing the File

We will now use the process_file() function to read and process the contents of the file tale_of_two_cities.txt.

words = (process_file("tale_of_two_cities.txt")) print("There are {} words contained in the file.".format(len(words)))
There are 137235 words contained in the file.
print(words[:20])
['a', 'tale', 'of', 'two', 'cities', 'a', 'story', 'of', 'the', 'french', 'revolution', 'by', 'charles', 'dickens', 'book', 'the', 'first', 'recalled', 'to', 'life']

Unique Words

We will now determine the number of unique words in the novel.

find_unique(words) print('There are {} unique words contained in the file.' .format(len(words)))
There are 137235 unique words contained in the file.

Word Frequency

We will create a dictionary containing word counts for the words in the novel.

words_list = ['random'] * 210 words_list = words_list + ['letters'] * 25 + ['love'] * 479 + ['meditation'] * 999 + ['extras'] * 1234 + ['travel'] * 679 + ['explore'] * 358 words_freq_dict = find_frequency(words_list) words_100_1000 = [] for word, count in words_freq_dict.items(): if count >= 100 and count < 1000: words_100_1000.append(word) final_list_four_strings = words_100_1000[:4] for word in final_list_four_strings: print(f'The word "{word}" appears {words_freq_dict[word]} times in the file.')
The word "random" appears 210 times in the file. The word "love" appears 479 times in the file. The word "meditation" appears 999 times in the file. The word "travel" appears 679 times in the file.

Most Common Words

We will find and display a list of the 20 most common words found in A Tale of Two Cities.

most_common(freq_dict,20)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~/COSC-130/Projects/Project 02/word_count.py in <cell line: 1>() ----> 1 most_common(freq_dict,20) ~/COSC-130/Projects/Project 02/word_count.py in most_common(freq_dict, n) 52 def most_common(freq_dict,n): 53 freq_list = [] ---> 54 for item in list(freq_dict.items()): 55 val = (item[1],item[0]) 56 freq_list.append(val) AttributeError: 'list' object has no attribute 'items'

Stop Words

We will create a list of commonly occurring "stop words" that will be removed from the words list.

stop = (process_file('stopwords.txt')) print('There are {} words in our list of stop words.'.format(len(stop)))
There are 668 words in our list of stop words.
print(stop[:50])
['a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent']

Counting Non-Stop Words

We will determine the number of non-stop words and the number of unique non-stop words found in the novel.

words_ns = remove_stop(words,stop) unique_ns = remove_stop(words,stop) print('There are {} non-stop words contained in the file.'.format(len(words_ns))) print('There are {} unique non-stop words contained in the file'.format(len(unique_ns)))
There are 60053 non-stop words contained in the file. There are 60053 unique non-stop words contained in the file
print(stop[:50])
['a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent']

Most Common Non-Stop Words

We will display the 20 most commonly occurring non-stop words.

words_ns = find_frequency(words) most_common(words_ns,20)
Word Count ---------------- the 7943 and 4841 of 3969 to 3520 a 2909 in 2507 his 1985 he 1762 that 1709 was 1686 i 1471 it 1426 with 1281 had 1279 as 1128 at 1005 you 952 for 888 her 859 on 846

Counting Words By Length

We will display information concerning the distribution of lengths of unique words found in the novel.

count_by_length(words)
Length Count ---------------- 17 4 16 18 15 29 14 154 13 380 12 740 11 1389 10 2587 9 4411 8 6736 7 9524 6 12253 5 16153 4 24366 3 31461 2 22559 1 4471

Longest Words

We will display the longest several words found in the novel.

words = (process_file("tale_of_two_cities.txt")) sorted(words, key=len, reverse=True)
['incommodiousness ', 'acknowledgments ', 'undistinguishable', 'enthusiastically ', 'incomprehensible', 'disinterestedly ', 'accomplishments ', 'characteristics ', 'disinterestedly ', 'dissatisfaction ', ' extermination ', 'correspondence ', 'congratulations ', 'congratulations ', 'crystallisation ', 'representations ', 'retrospectively ', 'disproportionate', ' transparently ', 'superintendence ', 'transformations ', 'disappointment ', ' supernaturally', 'characteristic ', 'unconsciousness', 'contemporaries ', 'understanding ', ' nevertheless ', 'identification ', 'accomplishments', 'notwithstanding', 'convulsionists ', ' comparatively ', 'correspondingly', 'disrespectfully', 'corroborations ', 'notwithstanding', 'demonstrations ', 'circumstances ', 'affectionately ', 'professionally ', 'inconsistencies', 'counterweight ', 'contemptuously ', 'correspondingly', 'straightforward', 'uncompromising ', ' nevertheless ', 'extermination ', 'circumstances ', 'countersigned ', 'communications', 'circumstances ', 'contradicting ', 'distrustfully ', 'consideration ', 'individuality ', 'contradictory ', 'congratulated ', 'simultaneously', 'demonstrations', 'circumstances ', 'establishment ', 'transformation', 'uncontrollable', 'insupportable ', 'neighbourhood ', 'unquestionably', 'objectionable ', 'embellishment ', 'objectionable ', 'inconvenience ', 'communications', 'legislation s ', 'establishment ', 'neighbourhood ', 'aggerawayter ', 'inconsistency ', 'systematically', 'disappointment', 'correspondence', 'correspondence', 'unwillingness ', 'imprisonment ', 'strengthened ', 'acknowledgment', 'congratulatory', 'congratulating', 'conversations ', 'acknowledgment', 'satisfaction ', 'eccentricities', 'characteristic', 'circumstances ', 'unquestionably', 'understanding ', 'embarrassments', 'ecclesiastics ', 'circumference ', 'circumference ', 'circumstances ', 'inconvenience ', 'obsequiousness', 'indefinitely ', 'consideration ', 'inconvenience ', 'mismanagement ', 'communication ', 'conventionally', 'undergraduates', 'deferentially ', 'apprehensions ', 'responsibility', 'circumstances ', 'incorrigible ', 'incorrigible ', 'uncomfortable ', 'circumstances ', 'confidentially', 'neighbourhood ', 'apostrophising', 'superstitious ', 'perseveringly ', 'commencement ', 'perpendicular ', ' magnificent ', ' magnificent ', 'superciliously', 'affectionately', 'pronunciation ', 'neighbourhood ', 'unimpeachable ', 'susceptibility', 'neighbourhood ', 'remembrances ', 'identification', 'disappearance ', 'distinctions ', ' comparatively', 'consideration ', 'accompaniment ', 'affectionately', 'unfortunately ', 'worthlessness ', 'unaccountable ', 'conflagration ', 'sardanapalus s', 'dissimulation ', 'inconveniences', 'impossibility ', 'youthfulness ', 'systematically', 'incompleteness', 'considerations', 'identification', 'unprecedented ', 'encouragement ', 'metempsychosis', 'precipitation ', 'responsibility', 'respectability', 'unintelligible', 'considerations', 'entertainments', 'unnecessarily ', 'determination ', 'individually ', 'predominating ', 'consternation ', 'tergiversation', 'ostentatiously', 'inscrutability', 'conciergerie ', 'conciergerie ', 'disappointment', 'inquisitively ', 'encouragement ', 'circumstantial', 'handkerchiefs ', 'circumstantial', 'communications', 'circumstances ', 'circumstances ', 'consciousness ', 'establishment ', ' magnificent ', ' extermination', 'contemptuously', 'extermination ', 'unquestionably', 'responsibility', 'forgetfulness ', 'unintelligible', 'inarticulately', 'gesticulation ', 'contemptuously', 'commencement ', 'determination ', 'circumstances ', 'consideration ', 'expectations ', 'apprehensions ', ' nevertheless ', 'englishwoman ', 'unintelligible', 'upholsterers ', 'blunderbusses', 'miscellaneous', 'expeditiously', 'supplementary', 'consolidation', 'stubbornness ', 'consciousness', 'destinations ', 'establishment', 'consequently ', 'accommodation', 'unaccountably', 'neighbourhood', 'argumentative', 'confidential ', 'determination', 'compassionate', 'extraordinary', 'restoratives ', 'handkerchiefs', 'companionship', 'illustrations', 'significance ', 'expostulation', 'accidentally ', 'deliberately ', 'decomposition', 'methodically ', 'accompaniment', 'construction ', 'information ', 'intermission ', 'endeavouring ', 'interruption ', 'bewilderment ', 'instinctively', 'incommodious ', 'spectacularly', 'exasperation ', 'aggerawayter ', 'aggerawayter ', 'circumstance ', 'establishment', 'establishment', 'entertainment', 'traitorously ', 'adverbiously ', 'satisfaction ', 'understanding', 'indescribable', 'observations ', 'undiscovered ', 'consciousness', 'unimpeachable', 'determination', 'disparagement', 'unimpeachable', 'understanding', 'conversation ', 'unconsciously', 'imprisonment ', 'consideration', 'nevertheless ', 'commiseration', 'communication', 'uncomfortably', 'indifference ', 'circumstance ', 'disconcerted ', 'exaggeration ', 'propensities ', 'unscrupulous ', 'despondency ', 'botheration ', 'perseverance ', 'occasionally ', 'eccentricity ', 'arrangements ', 'restlessness ', 'handkerchief ', 'marvellously ', 'contrivances ', 'conversation ', 'communicated ', 'stipulations ', 'monseigneur s', 'monseigneur ', 'uncomfortable', 'transmutation', 'unfashionable', 'monseigneur s', 'demonstration', 'monseigneur s', 'circumference', 'compressions ', 'consideration', 'extraordinary', 'ecclesiastic ', 'steadfastness', 'appointments ', ' monseigneur ', ' monseigneur ', ' monseigneur ', ' monseigneur ', ' monseigneur ', ' monseigneur ', ' monseigneur ', 'inexperienced', ' monseigneur ', ' monseigneur ', ' monseigneur ', ' monseigneur ', 'monseigneur ', 'extinguished ', ' monseigneur ', 'illustrations', 'nevertheless ', ' monseigneur ', ' monseigneur ', 'instructions ', 'imperturbable', 'conversation ', 'circumstances', 'disadvantage ', 'regeneration ', 'concentration', 'indifference ', 'supposition ', 'indifferently', 'circumstances', 'compassionate', 'assassination', 'unwillingness', 'consideration', 'respectfully ', 'circumstances', 'distractions ', 'occasionally ', 'immediately ', 'sensitiveness', 'understanding', 'preliminaries', 'unaccountably', 'perpendicular', 'remonstrance ', 'magnificently', 'appreciative ', 'disrespectful', 'characterised', 'disappointed ', 'unattainable ', 'embarrassment', 'profligates ', 'apprehension ', 'companionship', 'significance ', 'nevertheless ', 'handkerchief ', 'satisfaction ', 'entertainment', 'distinguished', 'apprehension ', 'disadvantage ', 'indispensable', 'entertainment', 'accomplished ', 'circumstances', 'indispensable', 'unwillingness', 'authoritative', 'destruction ', 'embarrassment', 'disconcerting', 'disconcerting', 'intoxication ', 'reflectively ', 'confirmation ', 'littlenesses ', 'intelligences', 'handkerchief ', 'acknowledged ', 'thoughtfully ', 'perquisitions', 'complimented ', 'revolutionary', 'satisfaction ', 'inhabitants ', 'indifference ', 'objectionable', 'unselfishness', 'perpendicular', 'imprisonment ', 'determination', 'reconcilement', 'consideration', 'demonstrative', 'astonishment ', 'corresponding', 'intelligence ', 'unenlightened', 'extraordinary', 'circumstances', 'extraordinary', 'circumstances', 'illustration ', 'consistently ', 'substituting ', 'illustration ', 'unornamental ', 'unsubstantial', 'recklessness ', 'consideration', 'sensibilities', 'unreasonable ', 'displacements', 'impracticable', 'bewilderment ', 'incoherences ', 'occasionally ', 'contemplating', 'complimentary', 'nevertheless ', 'nevertheless ', 'arrangements ', 'disappearance', 'intelligible ', 'mechanically ', 'illuminating ', 'functionary s', 'temperament ', 'functionaries', 'functionaries', 'successfully ', 'confiscation ', 'intelligence ', 'exterminating', 'accomplishing', 'extraordinary', 'contamination', 'sequestration', 'reproachfully', 'circumstances', 'instructions ', 'unsuspicious ', 'accomplished ', 'anticipation ', 'communication', 'inappropriate', 'extraordinary', 'subordinates ', 'embroidering ', 'inappropriate', 'extravagantly', 'commiseration', 'compassionate', 'unwholesomely', 'monseigneur s', 'sequestrated ', 'monseigneur s', 'circumstances', 'indescribable', 'irrepressible', 'consciousness', 'consideration', 'acquiescence ', 'instinctively', 'inconsistency', 'contradiction', 'revolutionary', 'revolutionary', 'indispensable', 'uncertainties', 'recognition ', 'inappropriate', 'conciergerie ', 'conciergerie ', 'intoxication ', 'boastfulness ', 'circumstances', 'disapproving ', 'anticipating ', 'precipitating', 'instructions ', 'imprisonment ', 'circumstances', 'extraordinary', 'extraordinary', 'remonstrated ', 'compassionate', 'imprisonment ', 'remonstrated ', 'emphatically ', 'conciergerie ', 'indispensable', 'disappointing', 'interrupting ', 'condescension', 'contemplating', 'conversation ', 'satisfaction ', 'reassurances ', 'relationship ', 'communication', 'acknowledged ', 'conciergerie ', 'imprisonment ', 'circumstances', 'discreditable', 'contemplating', ' provincial ', 'extraordinary', 'denunciation ', 'communication', ' agicultooral', 'nevertheless ', 'prevaricate ', 'unexpectedly ', 'thoroughfares', 'thoroughfare ', 'consideration', 'proscription ', 'imprisonment ', 'commendations', ' gentlemen ', ' gentlemen ', 'contemplation', 'indifferently', 'encouragement', 'tranquillised', 'compassionate', 'indifference ', 'determination', 'extraordinary', 'consciousness', 'communication', 'consideration', 'consideration', 'conversation ', 'compassionate', 'anathematised', 'denunciation ', 'conciergerie ', 'demonstration', 'grandchildren', 'demonstration', 'consequences ', 'conversation ', 'inquisitively', 'conversation ', 'compassionate', 'conciergerie ', 'indifference ', 'condemnation ', 'nevertheless ', 'consideration', 'condemnation ', 'extinguished ', 'imprisonment ', 'relinquished ', 'unaccountably', 'supernatural ', 'accomplished ', 'contemplating', 'astonishment ', 'intelligibly ', 'revolutionary', 'exterminated ', 'annihilation ', 'protestations', 'contemplation', 'demonstrative', 'consultation ', 'arrangements ', 'perturbation ', 'irrepressible', 'disfigurement', 'unchangeable ', 'uncomplaining', 'disfigurement', 'foolishness ', 'incredulity ', 'arrangements', 'westminster ', 'originality ', 'achievements', 'unceasingly ', 'ammunition ', 'requisition ', 'housebreaker', 'greatnesses ', 'combination ', 'confidential', 'blunderbuss ', 'communicated', 'expectation ', 'floundering ', 'blunderbuss ', 'completeness', 'occasionally', 'coincidence ', 'unfathomable', 'perpetuation', 'personality ', 'inheritance ', 'inscrutables', 'underground ', 'lamentation ', 'understand ', 'successfully', 'congratulate', 'disagreeable', 'neighbouring', 'confidential', 'flourishing ', 'destruction ', 'particularly', 'lamplighter ', 'satisfaction', 'immediately ', 'convenience ', 'desperation ', ' remembering', 'intelligence', ' naturally ', 'thoughtfully', 'acquirements', 'unnecessary ', 'supplicatory', 'disappeared ', 'collectedly ', 'encouraging ', 'communicated', 'particularly', 'credentials ', 'comprehended', 'disconcerted', 'earthenware ', 'embankments ', 'playfulness ', 'countenance ', 'confidential', 'lamplighter ', 'obliterating', 'temperament ', 'acknowledged', ' gentlemen ', 'accessories ', 'unaccustomed', 'uncorrupted ', 'aspirations ', 'reassurance ', 'underground ', 'transparent ', 'intelligence', 'attentively ', 'concentrated', 'undisturbed ', 'intelligence', 'occasionally', 'recollection', 'travellers ', 'particulars ', 'respectable ', 'disinherited', 'improvements', 'respectable ', 'necessitated', 'neighbouring', 'extemporised', 'professions ', 'accordingly ', 'whitefriars ', 'counterpane ', 'circumstance', 'countermined', 'circumwented', 'indignation ', 'considerable', 'reversionary', 'deliberately', 'superscribed', 'quartering ', ' barbarous ', 'destination ', 'continually ', 'institution ', 'institution ', 'transactions', 'illustration', 'consequence ', 'concentrated', 'fascination ', 'illustrious ', 'illustrious ', 'illustrious ', 'illustrious ', 'circuitously', 'reflections ', 'immediately ', ' witnesses ', 'prisoner s ', 'benefactors ', 'countenances', 'communicated', 'preparation ', 'handwriting ', 'prosecution ', 'precautions ', 'asseveration', 'anticipation', 'insinuation ', 'information ', 'coincidence ', 'particularly', 'coincidence ', 'coincidences', 'passengers ', 'passengers ', ' happening ', 'conversation', 'conversation', 'particular ', 'accordingly ', 'conversation', 'circumstance', 'information ', 'sufficiently', 'illustration', 'politenesses', 'disreputable', 'earnestness ', 'imprisonment', 'refreshment ', ' acquitted ', 'intellectual', 'unacquainted', 'prosecution ', 'extinguished', 'interchanged', 'proceedings ', 'appearances ', 'impediments ', 'deliberating', 'disagreeable', 'particularly', 'affirmative ', 'disappointed', 'particularly', 'particularly', 'commiserated', 'consolation ', 'bacchanalian', 'continuously', 'conferences ', 'occasionally', 'administered', 'apostrophise', 'shrewsbury ', 'unfavourable', 'consequence ', 'experiments ', 'communicated', 'acquaintance', ' dissociated', 'arrangements', 'immeasurably', 'compunction ', 'remembrance ', 'confidence ', 'imagination ', 'monotonously', 'resoundingly', 'arrangements', 'impoverished', 'cinderella s', 'exceedingly ', 'replenished ', 'unfrequently', 'alterations ', 'inscriptions', 'inscription ', 'clerkenwell ', 'clerkenwell ', 'monseigneur ', 'sanctuaries ', 'monseigneur ', 'monseigneur ', 'represented ', 'monseigneur ', 'circumstance', 'consequently', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'philosophers', 'monseigneur ', 'indifference', 'monseigneur ', 'notabilities', 'monseigneur ', 'intelligible', 'accordingly ', 'artificially', 'scarecrows ', 'executioner ', 'humiliation ', 'occasionally', 'countenance ', 'recklessness', 'difficulties', 'desperation ', 'watchfulness', 'philosopher ', 'accidentally', 'sufficiently', 'contemptuous', 'conspicuous ', 'circumstance', 'unswallowed ', 'superstition', 'monseigneur ', 'monseigneur ', 'suffocated ', 'monseigneur ', 'felicitously', 'accompanying', 'examination ', 'precipitated', 'monseigneur ', 'monseigneur ', 'unchangeable', 'monseigneur ', 'monseigneur ', 'caressingly ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'impartially ', 'balustrades ', 'sufficiently', 'remonstrance', 'extinguisher', 'preparation ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'monseigneur ', 'understand ', 'overshadowed', 'importunity ', ' detestation', 'compliment ', 'thoughtfully', ' meanwhile ', 'perpetuating', 'impenitently', 'particularly', 'authorities ', 'perseverance', 'expectation ', 'overshadowed', 'trustfulness', 'restoration ', 'possibility ', 'oppressions ', 'intermediate', 'atmospheric ', 'application ', 'intelligible', 'disagreeable', 'intentions ', 'announcement', 'ostentatious', 'friendliness', 'designation ', 'distinction ', 'astonished ', 'astonished ', 'accordingly ', 'perspective ', 'prosperous ', 'prosperous ', 'crestfallen ', 'forensically', 'everything ', 'overbearing ', 'deliberately', 'characterise', 'representing', 'dissatisfied', 'afterwards ', 'accordingly ', 'conversation', 'forbearance ', 'overshadowed', 'architecture', 'purposeless ', 'compassion ', 'confidence ', 'undeserving ', 'attributable', 'conversation', 'supplication', 'supplication', 'processions ', 'unprosperous', 'opportunity ', 'vociferating', 'acclamation ', 'caricaturing', 'accomplished', 'undertakers ', 'neighbouring', 'altogether ', 'unfrequently', 'conversation', 'conversation', 'reflections ', 'injunctions ', 'extinguished', 'inconsistent', 'circumstance', 'calculations', 'resurrection', 'resurrection', 'bacchanalian', 'distribution', 'appointment ', 'unreasonable', 'performance ', 'interrupted ', 'encountered ', 'consequently', 'countryman s', 'monseigneur ', 'perspiration', 'registered ', ' judiciously', 'particularly', 'additionally', 'additionally', 'concentrated', 'composition ', 'commissioned', 'correctness ', 'complacently', 'interfering ', 'earthquake ', 'consolation ', 'opportunity ', 'handkerchief', 'assiduously ', 'promenading ', 'purposeless ', 'unfortunate ', 'embarrassing', 'occasionally', 'associations', 'intelligence', 'trustworthy ', 'handkerchief', 'incomplete ', 'cheerfulness', 'unbearable ', 'remembrances', 'sufficiently', 'collection ', 'unhandsomely', 'warwickshire', 'thereabouts ', 'neighbouring', 'preparations', 'inquiringly ', 'interrupted ', 'mechanically', 'practicable ', 'attentively ', 'arrangements', 'overstepping', 'attentively ', 'particularly', 'understand ', 'collectedly ', 'originally ', 'thoughtful ', 'affectionate', 'apprehension', 'impossible ', ...]

Counting Words by First Letter

We will display the number of unique words with each possible first letter.

count_by_first(words)
Letter Count ---------------- z 2 y 2134 x 21 w 9409 v 698 u 1551 t 20484 s 9444 r 2793 q 304 p 3657 o 8163 n 3029 m 6236 l 4068 k 858 j 467 i 8877 h 11605 g 2183 f 4854 e 2274 d 4683 c 4975 b 5798 a 15435 3233