Building a Syllable Dictionary with Python

As most projects do, this one started small. I already had a way to preview Markdown files. Proofer also counts words, sentences, and paragraphs, then averages those values and estimates reading time. The script finds complex or overused phrases, and highlights repeated words, be verbs, and adverbs that make for weak writing, too. It also calculates the Gunning Fog Index, and scores for the Flesch-Kincaid reading ease and grade level tests. I built this script to give me a few more features than Marked, which I had used to proof my writing for years. It worked well, but runtime shot up if it had to process posts with more than a few hundred words. I set out to fix this, and ended up building an entire syllable dictionary in Python. Proofer has still not gotten any faster.

Proofer had no trouble with small files, but runtime shot up if it had to process posts with more than a few hundred words. Runtime went up as word count did, and so I started hunting for performance wins in the 129-line method that counted syllables. Called for every word in the source file but not optimized for speed or memory use, this seemed like a good place to start.

The Original Syllable Counter

Counting syllables seems easy. Spend some time looking for examples of others doing this, though, and you will soon realize that the English language makes this quite hard. I did not want to sink a ton of time into this, so I used this script from Emre Aydin.

  1

  2

  3

  4

  5

  6

  7

  8

  9

 10

 11

 12

 13

 14

 15

 16

 17

 18

 19

 20

 21

 22

 23

 24

 25

 26

 27

 28

 29

 30

 31

 32

 33

 34

 35

 36

 37

 38

 39

 40

 41

 42

 43

 44

 45

 46

 47

 48

 49

 50

 51

 52

 53

 54

 55

 56

 57

 58

 59

 60

 61

 62

 63

 64

 65

 66

 67

 68

 69

 70

 71

 72

 73

 74

 75

 76

 77

 78

 79

 80

 81

 82

 83

 84

 85

 86

 87

 88

 89

 90

 91

 92

 93

 94

 95

 96

 97

 98

 99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129
def sylco(word) :



    word = word.lower()



    # exception_add are words that need extra syllables

    # exception_del are words that need less syllables



    exception_add = ['serious','crucial']

    exception_del = ['fortunately','unfortunately']



    co_one = ['cool','coach','coat','coal','count','coin','coarse','coup','coif','cook','coign','coiffe','coof','court']

    co_two = ['coapt','coed','coinci']



    pre_one = ['preach']





    syls = 0 #added syllable number

    disc = 0 #discarded syllable number



    #1) if letters < 3 : return 1

    if len(word) <= 3 :

        syls = 1

        return syls



    #2) if doesn't end with "ted" or "tes" or "ses" or "ied" or "ies", discard "es" and "ed" at the end.

    # if it has only 1 vowel or 1 set of consecutive vowels, discard. (like "speed", "fled" etc.)



    if word[-2:] == "es" or word[-2:] == "ed" :

        doubleAndtripple_1 = len(re.findall(r'[eaoui][eaoui]',word))

        if doubleAndtripple_1 > 1 or len(re.findall(r'[eaoui][^eaoui]',word)) > 1 :

            if word[-3:] == "ted" or word[-3:] == "tes" or word[-3:] == "ses" or word[-3:] == "ied" or word[-3:] == "ies" :

                pass

            else :

                disc+=1



    #3) discard trailing "e", except where ending is "le"  



    le_except = ['whole','mobile','pole','male','female','hale','pale','tale','sale','aisle','whale','while']



    if word[-1:] == "e" :

        if word[-2:] == "le" and word not in le_except :

            pass



        else :

            disc+=1



    #4) check if consecutive vowels exists, triplets or pairs, count them as one.



    doubleAndtripple = len(re.findall(r'[eaoui][eaoui]',word))

    tripple = len(re.findall(r'[eaoui][eaoui][eaoui]',word))

    disc+=doubleAndtripple + tripple



    #5) count remaining vowels in word.

    numVowels = len(re.findall(r'[eaoui]',word))



    #6) add one if starts with "mc"

    if word[:2] == "mc" :

        syls+=1



    #7) add one if ends with "y" but is not surrouned by vowel

    if word[-1:] == "y" and word[-2] not in "aeoui" :

        syls +=1



    #8) add one if "y" is surrounded by non-vowels and is not in the last word.



    for i,j in enumerate(word) :

        if j == "y" :

            if (i != 0) and (i != len(word)-1) :

                if word[i-1] not in "aeoui" and word[i+1] not in "aeoui" :

                    syls+=1





    #9) if starts with "tri-" or "bi-" and is followed by a vowel, add one.



    if word[:3] == "tri" and word[3] in "aeoui" :

        syls+=1



    if word[:2] == "bi" and word[2] in "aeoui" :

        syls+=1



    #10) if ends with "-ian", should be counted as two syllables, except for "-tian" and "-cian"



    if word[-3:] == "ian" :

    #and (word[-4:] != "cian" or word[-4:] != "tian") :

        if word[-4:] == "cian" or word[-4:] == "tian" :

            pass

        else :

            syls+=1



    #11) if starts with "co-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.



    if word[:2] == "co" and word[2] in 'eaoui' :



        if word[:4] in co_two or word[:5] in co_two or word[:6] in co_two :

            syls+=1

        elif word[:4] in co_one or word[:5] in co_one or word[:6] in co_one :

            pass

        else :

            syls+=1



    #12) if starts with "pre-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.



    if word[:3] == "pre" and word[3] in 'eaoui' :

        if word[:6] in pre_one :

            pass

        else :

            syls+=1



    #13) check for "-n't" and cross match with dictionary to add syllable.



    negative = ["doesn't", "isn't", "shouldn't", "couldn't","wouldn't"]



    if word[-3:] == "n't" :

        if word in negative :

            syls+=1

        else :

            pass



    #14) Handling the exceptional words.



    if word in exception_del :

        disc+=1



    if word in exception_add :

        syls+=1





    # calculate the output

return numVowels - disc + syls

This seemed like a lot just to count a few syllables. It made poor use of memory, too, by doing things like allocating variables that it then used one time. "Python programmers shouldn't worry about things like efficiency and memory usage"--but I do. I knew I could do better, and so I did.

The Refactored Syllable Counter

  1

  2

  3

  4

  5

  6

  7

  8

  9

 10

 11

 12

 13

 14

 15

 16

 17

 18

 19

 20

 21

 22

 23

 24

 25

 26

 27

 28

 29

 30

 31

 32

 33

 34

 35

 36

 37

 38

 39

 40

 41

 42

 43

 44

 45

 46

 47

 48

 49

 50

 51

 52

 53

 54

 55

 56

 57

 58

 59

 60

 61

 62

 63

 64

 65

 66

 67

 68

 69

 70

 71

 72

 73

 74

 75

 76

 77

 78

 79

 80

 81

 82

 83

 84

 85

 86

 87

 88

 89

 90

 91

 92

 93

 94

 95

 96

 97

 98

 99

100

101
# Method: MySyllableCount

# Purpose: Accept a word and return the number of syllables. Modded for speed.

# Parameters: 

# - word: Word to be parsed. (String)

# H/t: https://github.com/eaydin/sylco

def MySyllableCount(word):

    word = word.lower()



    syls = 0 # Number of added syllables

    disc = 0 # Number of discarded syllables



    #1) If three or less letters, one syllable

    if len(word) <= 3 :

        return 1



    #2) If doesn't end with "ted" or "tes" or "ses" or "ied" or "ies", discard "es" and "ed" at the end.

    # if it has only 1 vowel or 1 set of consecutive vowels, discard. (like "speed", "fled" etc.)

    double_vowel = len(findall(r'[eaoui][eaoui]',word))



    if word[-2:] in ["es", "ed"]:

        if double_vowel > 1 or len(findall(r'[eaoui][^eaoui]',word)) > 1 :

            if word[-3:] in ["ted", "tes", "ses", "ied", "ies"]:

                pass

            else:

                disc+=1



    #3) discard trailing "e", except where ending is "le"  

    if word[-1:] == "e" :

        if word[-2:] == "le" and word not in ['whole','mobile','pole','male','female','hale','pale','tale','sale','aisle','whale','while']:

            pass

        else :

            disc+=1



    #4) check if consecutive vowels exists, triplets or pairs, count them as one.

    disc += double_vowel + len(findall(r'[eaoui][eaoui][eaoui]',word))



    #5) count remaining vowels in word.

    numVowels = len(findall(r'[eaoui]',word))



    #6) add one if starts with "mc"

    if word[:2] == "mc" :

        syls+=1



    #7) add one if ends with "y" but is not surrouned by vowel

    if word[-1:] == "y" and word[-2] not in "aeoui" :

        syls +=1



    #8) add one if "y" is surrounded by non-vowels and is not in the last word.

    for i,j in enumerate(word) :

        if j == "y" :

            if (i != 0) and (i != len(word)-1) :

                if word[i-1] not in "aeoui" and word[i+1] not in "aeoui" :

                    syls+=1



    #9) if starts with "tri-" or "bi-" and is followed by a vowel, add one.

    if word[:3] == "tri" and word[3] in "aeoui" :

        syls+=1



    if word[:2] == "bi" and word[2] in "aeoui" :

        syls+=1



    #10) if ends with "-ian", should be counted as two syllables, except for "-tian" and "-cian"

    if word[-3:] == "ian" :

        if word[-4:] in ["cian", "tian"]:

            pass

        else :

            syls+=1



    #11) if starts with "co-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:2] == "co" and word[2] in 'eaoui' :

        if (set(['coapt','coed','coinci']).intersection([word[:4], word[:5], word[:6]])):

            syls+=1

        elif (set(['cool','coach','coat','coal','count','coin','coarse','coup','coif','cook','coign','coiffe','coof','court']).intersection([word[:4], word[:5], word[:6]])):

            pass

        else :

            syls+=1



    #12) if starts with "pre-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:3] == "pre" and word[3] in 'eaoui' :

        if word[:6] in ['preach']:

            pass

        else :

            syls+=1



    #13) check for "-n't" and cross match with dictionary to add syllable.

    if word[-3:] == "n't" :

        if word in ["doesn't", "isn't", "shouldn't", "couldn't","wouldn't"]:

            syls+=1

        else:

            pass



    #14) Handling the exceptional words.

    # exception_del are words that need less syllables

    if word in ['fortunately','unfortunately']:

        disc+=1

    # exception_add are words that need extra syllables

    if word in ['serious','crucial']:

        syls+=1



    # calculate the output

return numVowels - disc + syls

Take comments and whitespace out, and Emre's counter comes in at 70 lines long. My version sits at 66 lines but cuts runtime by 9%. It pays to worry about things like efficiency and memory usage after all. As I marveled at my own brilliance, though, I realized that this might not matter. Emre gave this a shot back in 2016 with no guarantee of its accuracy. Cutting runtime by any number meant nothing if the algorithm did not work. Before I spent any more time trimming an unproven script, I needed to test it against a known good source; I needed to test it against a dictionary.

This seemed like an easy task:

  1. Read a word from the dictionary
  2. Use the algorithm to count its syllables
  3. Compare the algorithm's count to the dictionary's

As long as the algorithm gave me the right syllable count most of the time, I could speed it up, make it more accurate, and then repeat that process as many times as I wanted. I ran into problems starting with step one, though: UNIX has no native dictionary. It does have a wordlist, with over 200,000 words, but I needed a dictionary to give me valid words *and* the number of syllables they contained. Without that number, from an authoritative source, I had nothing to test the algorithm's accuracy.

Because I needed my script to pull words and syllable counts from the dictionary, I needed something with a command line interface. The built-in dictionary app on my Mac, dictionary.app, seemed like an easy answer, but supports only [basic programatic access](https://developer.apple.com/library/archive/documentation/UserExperience/Conceptual/DictionaryServicesProgGuide/access/access.html). dictd looked like it fit the bill, but I had trouble getting it to install. It "succeeded", but never ran. A third-party dictionary violated my self-reliance principle anyway, so I cleared all the install files and went back to my old standby: web scraping.

My script started out simple. Read words from the UNIX wordlist at /usr/share/dict/words, get a syllable breakdown from Google or HowManySyllables.com, and record the result in a new wordlist. Check out my first try below.

Syllable Dictionary Builder, version 1

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85
# Method: BuildSyllableDictionary

# Purpose: Enrich wordlist with true syllables from a dictionary

# Parameters: none.

def BuildSyllableDictionary():

    # Create a new instance of the web scraper

    s = Scraper()



    # Open the wordlist

    s_fd = open("/usr/share/dict/words", "r")



    # Create the syllable dictionary file if it does not exist.

    if (not exists("./syllable_dictionary.txt")):

        open("./syllable_dictionary.txt", "w").close()



    # Open the syllable dictionary

    c_fd = open("./syllable_dictionary.txt", "r")



    # Clear and open a temporary output syllable dictionary, to write to.

    open("./syllable_dictionary.txt.bak", "w").close()

    d_fd = open("./syllable_dictionary.txt.bak", "a")



    # Enumerate the wordlist

    for i, word in enumerate(s_fd):

        # Convert wordlist word to lowercase and strip newline

        word = word.lower().strip()



        # Read a line from the syllable dictionary

        comp = c_fd.readline().split(",")[0].strip()



        # If the word in the wordlist has already been added to the syllable

        # dictionary, skip it.

        if (word == comp):

            continue



        # Query Google for the definition of the wordlist's word. Depending on

        # the user agent string, the number of syllables may be in a span or

        # div. If neither element is in the response, try another source.

        resp = s.scrape("https://google.com/search?q=define%20"+word)

        if ('<span data-dobid="hdw">' in resp):

            resp = resp.split('<span data-dobid="hdw">',1)[1].split("</span>",1)[0]

            print(word,",",resp.count("·")+1)

            d_fd.write(word+","+str(resp.count("·")+1)+'\n')

            continue

        elif ('<div data-hveid="20">' in resp):

            resp = resp.split('<div data-hveid="20">',1)[1].split(">",1)[1].split("<",1)[0]

            print(word,",",resp.count("·")+1)

            d_fd.write(word+","+str(resp.count("·")+1)+'\n')

            continue

        else:

            resp = ""



        # If Google does not have a syllable breakdown for the target word,

        # try HowManySyllables.com.

        if (resp == ""):

            resp = s.scrape("https://www.howmanysyllables.com/words/"+word)



        # If the response contains a certain paragraph, parse the number of

        # syllables. If not all has failed, so return -1 for the syllable count

        if ('<p id="SyllableContentContainer">' in resp and '<span class="Answer_Red">' in resp.split('<p id="SyllableContentContainer">',1)[1]):

            resp = resp.split('<p id="SyllableContentContainer">',1)[1].split('<span class="Answer_Red">',1)[1].split('</span>',1)[0]

            print(word,",",resp.count("-")+1)

            d_fd.write(word+","+str(resp.count("-")+1)+'\n')

        else:

            print(word,",",-1)

            d_fd.write(word+","+str(-1)+'\n')



    # Close all files

    s_fd.close()

    d_fd.close()

    c_fd.close()



    # Append the temporary syllable dictionary to the actual

    # syllable dictionary, then remove the temp file.

    s_fd = open("./syllable_dictionary.txt.bak", "r")

    d_fd = open("./syllable_dictionary.txt", "a")

    for i,line in enumerate(s_fd):

        d_fd.write(line)



    # Close files

    s_fd.close()

    d_fd.close()



    # Cleanup

    remove("./syllable_dictionary.txt.bak")

    del s

It took almost two days to get through A and B, at just over 25,000 words. I did not have the patience to spend a month working through the other 90% of the dataset, so I made some changes for B to Z.

First, I split up the 235,886 line wordlist by letter. This gave me twenty-six sub-wordlists: a.txt, b.txt, and so on.

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31
# Method: SplitUpWordlist

# Purpose: Break wordlist into sub-wordlists by letter of alphabet

# Parameters: none.

def SplitUpWordlist():

    # Open the wordlist

    wordlist_fd = open("/usr/share/dict/words", "r")



    # Keep track of which letters have their own output file

    alphabet = []



    # Enumerate the wordlist. For each letter, create a new wordlist file in

    # the form {letter}.txt, that contains all letters starting with {letter}.

    for i,line in enumerate(wordlist_fd):

        # Transform the wordlist's word

        line = line.strip().lower()



        # If the script has not processed {letter} yet, create {letter}.txt

        # and write all words starting with {letter} to it.

        if (line[0] not in alphabet):

            try:

                d_fd.close()

            except:

                pass

            open("./"+line[0]+".txt", "w").close()

            d_fd = open("./"+line[0]+".txt", "a")

            d_fd.write(line+'\n')

        else:

            d_fd.write(line+'\n')



    # Close the file

    wordlist_fd.close()

It would have taken me another twenty-six days to work through those twenty-six wordlists, so I decided to put my quad core processor to work by enabling multiprocessing. Python's handy multiprocessing module would farm out each dictionary lookup to its own core, which would allow me to use all eight of my logical processors to make quick work of this task. BuildSyllableDictionaryWithMultiprocessing farmed a sub-wordlist out to each processor. BuildDictionary enumerated the sub-wordlist to build the syllable dictionary. DownloadSyllable did the web scraping to capture the syllable count.

  1

  2

  3

  4

  5

  6

  7

  8

  9

 10

 11

 12

 13

 14

 15

 16

 17

 18

 19

 20

 21

 22

 23

 24

 25

 26

 27

 28

 29

 30

 31

 32

 33

 34

 35

 36

 37

 38

 39

 40

 41

 42

 43

 44

 45

 46

 47

 48

 49

 50

 51

 52

 53

 54

 55

 56

 57

 58

 59

 60

 61

 62

 63

 64

 65

 66

 67

 68

 69

 70

 71

 72

 73

 74

 75

 76

 77

 78

 79

 80

 81

 82

 83

 84

 85

 86

 87

 88

 89

 90

 91

 92

 93

 94

 95

 96

 97

 98

 99

100

101

102

103

104
# Method: BuildSyllableDictionaryWithMultiprocessing

# Purpose: Enrich wordlist with syllables from dictionary, with multiprocessing

# Parameters: none.

def BuildSyllableDictionaryWithMultiprocessing():

    # Find all source files of the form {letter}.txt

    files = [x for x in listdir("./") if len(x) == 5]



    # Use multithreading to speed up capturing syllables for entire dictionary

    with Pool() as pool:

        pool.map(BuildDictionary, files)



# Method: BuildDictionary

# Purpose: Capture syllables for all words in a wordlist

# Parameters:

# - tgt: Source wordlist (String)

def BuildDictionary(tgt):

    # Clear an interim output file, then open for appending

    open("./interim_" + tgt, "w").close()

    d_fd = open("./interim_" + tgt, "a")



    # Open the source wordlist

    s_fd = open("./" + tgt, "r")



    # Open the existing syllable dictionary

    c_fd = open("./syllable_" + tgt, "r")



    # Enumerate the wordlist. 

    for i,word in enumerate(s_fd):

        # Convert wordlist word to lowercase and strip newline

        word = word.lower().strip()



        # Read a line from the syllable dictionary

        comp = c_fd.readline().split(",")[0].lower().strip()



        # If the word in the wordlist has already been added to the syllable

        # dictionary, skip it.

        if (comp == word):

            continue



        # If the word has not been added to the syllable dictionary, capture

        # the syllable count and append the word and that number to the dict.

        d_fd.write(line + "," + str(DownloadSyllable(line)) + '\n')





    # Close all files

    d_fd.close()

    s_fd.close()

    c_fd.close()



    # Append the temporary syllable dictionary to the actual

    # syllable dictionary, then remove the temp file.

    s_fd = open("./interim_" + tgt, "r")

    d_fd = open("./" + tgt, "a")

    for i,line in enumerate(s_fd):

        d_fd.write(line)



    # Close files

    d_fd.close()

    s_fd.close()



    # Delete temp file

    remove("./interim_"+tgt)



# Method: DownloadSyllable

# Purpose: Capture syllables for single word

# Parameters:

# - word: Target word (String)

def DownloadSyllable(word):

    # Create a new instance of the Scraper class. This is necessary since these

    # queries will be split up among multiple processors with dissimilar memory

    s = Scraper()



    # Query Google for the definition of the wordlist's word. Depending on

    # the user agent string, the number of syllables may be in a span or

    # div. If neither element is in the response, try another source.

    resp = scrape("https://google.com/search?q=define%20"+word)

    if ('<span data-dobid="hdw">' in resp):

        resp = resp.split('<span data-dobid="hdw">',1)[1].split("</span>",1)[0]

        print(word,",",resp.count("·")+1)

        return resp.count("·")+1

    elif ('<div data-hveid="20">' in resp):

        resp = resp.split('<div data-hveid="20">',1)[1].split(">",1)[1].split("<",1)[0]

        print(word,",",resp.count("·")+1)

        return resp.count("·")+1

    else:

        resp = ""



    # If Google does not have a syllable breakdown for the target word,

    # try HowManySyllables.com.

    if (resp == ""):

        resp = scrape("https://www.howmanysyllables.com/words/"+word)



    # If the response contains a certain paragraph, parse the number of

    # syllables. If not all has failed, so return -1 for the syllable count

    if ('<p id="SyllableContentContainer">' in resp and '<span class="Answer_Red">' in resp.split('<p id="SyllableContentContainer">',1)[1]):

        resp = resp.split('<p id="SyllableContentContainer">',1)[1].split('<span class="Answer_Red">',1)[1].split('</span>',1)[0]

        print(word,",",resp.count("-")+1)

        return resp.count("-")+1

    else:

        print(word,",",-1)

        return -1



    # Cleanup

    del s

Within a few minutes, Google blocked me. I gave cloud infrastructure a try, but Google just kept blocking my DoS scraper after about 300 connections. I had to write a new method to recover from these failures.

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20
# Method: RecoverFromError

# Purpose: Append contents of partial temp files to syllable dictionaries.

# Parameters: none.

def RecoverFromError():

    # Find all temp source files

    files = [x for x in listdir("./") if ".txt" in x and "interim" in x]



    # Append cotnents of partial temp files to syllable dictionary

    for tgt in files:

        # Open the files

        s_fd = open("./"+tgt, "r")

        d_fd = open("./"+tgt.replace("interim", "syllable"), "a+")



        # Do the copying

        for i,line in enumerate(s_fd):

            d_fd.write(line)



        # Close the files

        d_fd.close()

        s_fd.close()

How Many Syllables never did block me, so I just commented out the part that checked Google first and drove on. Within a day, I had syllable count for all twenty-six sub-wordlists. FinalCheck made sure each sub-wordlist was not missing any words, CombineWordlists produced a master syllable dictionary I named webS, and CompareWordlists made sure the syllable dictionary had all the words from the original wordlist.

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96
# Method: FinalCheck

# Purpose: Compare wordlists to syllable dictionaries

# Parameters: none.

def FinalCheck():

    # Find all source wordlists of the form {letter}.txt

    files = sorted([x for x in listdir("./") if ".txt" in x and len(x) == 5])



    # Compare each wordlist to its syllable dictionary, to make sure syllable

    # dictionary has all words from the source wordlist.

    for tgt in files:

        # Open the source wordlist and attempt to open syllable dictionary.

        # If the dictionary does not yet exist, notify the user.

        source_fd = open("./"+tgt, "r")

        if (not exists("./syllable_"+tgt)):

            print("File '%s' does not exist." % ("./syllable_"+tgt))

            continue



        # Open the syllable dictionary

        sylls_fd = open("./syllable_"+tgt, "r")



        # Instantiate boolean to track similarity

        notTheSame = False



        # Enumerate the wordlist.

        for i,source_line in enumerate(source_fd):

            # Read a line from the source wordlist and the syllable dictionary

            source_line = source_line.strip().lower()

            sylls_line = sylls_fd.readline().split(",")[0].strip().lower()



            # If the lines do not match, notify the user and set the boolean.

            if (source_line != sylls_line):

                print("Source line %d, '%s', and syllable dictionary line, '%s', do not match." % (i, source_line, sylls_line))

                notTheSame = True



        # Close all files

        source_fd.close()

        sylls_fd.close()



        # Tell the user whether the syllable dictionary has all the words from

        # the wordlist or if they diverge.

        if (notTheSame):

            print(tgt,"and","./syllable_"+tgt+" are not the same.")

        else:

            print(tgt,"and","./syllable_"+tgt+" ARE the same.")



# Method: CombineWordlists

# Purpose: Combine disparate sub-wordlists into a single wordlist.

# Parameters:

# - name: Filename for the monolithic wordlist (String)

def CombineWordlists(name):

    # Find all sub-wordlists like syllable_{letter}.txt. Sort alphabetically.

    files = sorted([x for x in listdir("./") if "syllable_" in x and ".txt" in x])



    # Clear the master wordlist, then open for appending sub-wordlists

    open("./"+name, "w").close()

    master_fd = open("./"+name, "a")



    # Open each wordlist and write its contents to the master wordlist.

    for sub_wordlist in files:

        sub_fd = open("./"+sub_wordlist, "r")

        for i,line in enumerate(sub_fd):

            master_fd.write(line)

        # Close file

        sub_fd.close()

    # Close file

    master_fd.close()



# Method: CompareWordlists

# Purpose: Compare master wordlist with syllable dictionary.

# Parameters:

# - master: File path for the master wordlist (String)

# - syl: File path for the syllable dictionary (String)

def CompareWordlists(master, syl):

    # Open the wordlists

    master_fd = open(master, "r")

    syl_fd = open(syl, "r")



    # Enumerate the master wordlist

    for i,master_line in enumerate(master_fd):

        # Read a line from the syllable dictionary

        syl_line = syl_fd.readline()



        # Transform the lines for comparison

        master_line = master_line.lower().strip()

        syl_line = syl_line.split(",")[0].strip()



        # Compare the two lines

        if (master_line != syl_line):

            # Print the line number and the different words

            print("Line %d: master '%s' vs syllable dictionary '%s'" % (i, master_line, syl_line))

            # Exit so we can fix this line, then re-run

            break



    # Close files

    master_fd.close()

    syl_fd.close()

So how did my web scraper do? FindNoSyllables--below--told me that out of 235,886 words in the wordlist, the scraper could not find syllables for 102,461. Not great.

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27
# Method: FindNoSyllables

# Purpose: Find the words for which the dictionary does not count syllables

# Parameters: none.

def FindNoSyllables():

    # Open the syllable wordlist

    s_fd = open("./webS", "r")



    # Keep track of the number of words without syllable counts

    count = 0



    # Enumerate the wordlist.

    for i,line in enumerate(s_fd):

        # Get the word and it's syllable count

        line = line.split(",")

        word = line[0].strip()

        sylls = int(line[1])



        # If the word has no syllable count, notify user and update total

        if (sylls == -1):

            print(word)

            count += 1



    # Close file

    s_fd.close()



    # Print count for user

    print(count,"records found.")

A cursory look through that list, though, revealed that many appeared to be proper nouns. I checked with a new method, creatively named FindNoSyllablesAndNotProperNouns.

 1

 2

 3

 4

 5

 6

 7

 8

 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34
# Method: FindNoSyllablesAndNotProperNouns

# Purpose: Find words without syllable counts, that aren't proper nouns

# Parameters: none.

def FindNoSyllablesAndNotProperNouns():

    # Open the syllable wordlist, and the master wordlist

    s_fd = open("./webS", "r")

    w_fd = open("/usr/share/dict/words", "r")



    # Keep track of the number of words without syllable counts that are

    # not proper nouns.

    count = 0



    # Enumerate the wordlist.

    for i,line in enumerate(s_fd):

        # Get the word and it's syllable count

        line = line.split(",")

        word = line[0].strip()

        sylls = int(line[1])



        # Get the word from the master wordlist

        m_word = w_fd.readline().strip()



        # If the word has no syllable count, see if it's a proper noun

        if (sylls == -1):

            if (not m_word[0].isupper()):

                print(word)

                count += 1



    # Close files

    s_fd.close()

    w_fd.close()



    # Print count for user

    print(count,"records found.")

85,524 left. Just by checking for proper nouns alone, I ruled out 16,937 words that, of course, will not be in a dictionary. What about the rest? Many are words with prefixes and suffixes whose roots exist in the dictionary, but that do not exist in their present form. Take "absenteeship", for example. Others are foreign words like "haori", which is a traditional Japanese jacket. "Pharyngotonsillitis" is a medical condition. Are these words? Yes. Am I concerned about testing my syllable counter against them? No. Whether it counts seven syllables in pharyngotonsillitis or not, I will sleep just fine at night.


As most projects do, this one started small--and then it got out of hand. Proofer has not gotten any faster, but now I have a syllable dictionary to test the syllable count method so that I can then write a new method optimized for speed and memory use. Maybe, at the end of all this, Proofer will be a bit faster. That's ... something. Right?

For those of you who stuck around, you can download the syllable dictionary here.

Permalink.