The proper distribution of letters in spoken English???

vk6bgn · Jul 20, 2013

Hello All,

Here's the story... I have a Morse Code program on a 28X2 that turns out random letters in groups of 5 for practice. But what my fellow radio compadres and I have noticed is there is a small issue with something like....

Random w4
b10 = w4 // 26 + 1 ' (random numbers 1-26 are EEprom locations, the actual encoded letter are in each location)

This works fine but the frequency of letters appear evenly. And this is not the case as we all know with spoken English. After reading a bit on the net, the letter E appears approximately 130 out of 1000 letters spoken. And Z only 1 in a 1000. So, not making it past high school algebra some 30 plus years ago, I've take the approach of rolling 2 dice. One with 1 - 13 sides and one with 1 - 14 sides. This should produce some sort of probability(?) or distribution(?) between 2 and 27. This will point to EEprom locations 2 - 27 where the encoded letter resides. Letters like E and T will be in locations 13 and 14 and letters like Z and X will be in locations 2 and 27 with the rest of the locations filled in with letters based on their frequencies.
But I don't think this is the actual distribution of letters. Possibly close(?a big guess?) but not exact.

So, I guess my questions is.... is there some other way to achieve the proper distribution of letters since it is not "linear" like rolls of the dice if that is the right way to describe the distribution of dice if graphed out on paper?

The only other thing I could think of, if there were 1000 available spare memory locations, I could fill memory location 1 with the encoded letter Z, locations 435 to 565 with the encoded letter E (130 locations) and location 1000 with the letter X and then fill in the rest of the memory locations with the appropriate number of letters like T = 92 locations, H = 34 locations, M = 25 locations, b = 10 locations etc. etc. and then use the Random W4 thing from above but b10 = w4 // 1000 + 1
Or maybe the dice approach again to randomly point to the 1000 different memory addresses and produce the exact distribution of Morse Code practice letters??? I just can't get my head around this.

Your thoughts and comments are most appreciated.

(I actually like a even distribution of letters as it give (me) more practice on letters which rarely appear like Z and X. But some of my fellow radio brethren hate it. So I thought I would include both in my PICAXE project)

Thanks,
HamRadioAddict

westaust55 · Jul 20, 2013

A total guess at a suggestion which may promote some workable solution:

Get a value which in theory will be random but a linear distribution
then apply a sinusoidal function over the range 0 to 180 degrees so the result instead of more linear has a "bell curve" shape about it.
Then based on the value from the sinusoidal value use that as an index/vector to the letter.
spread you letters out so infrequently used letters are near 0 and 180 degrees and higher frequency use letters near the 90 degree/mid range.

Would have to give this more thought myself towards any code. Some for later if nothing further develops from the above while I dash out for a while . . . .

premelec · Jul 20, 2013

In the past I have written QBASIC programs to generate code groups of particular length and content - I could see having various groups of letters with various weights and counting how many times one in the group had appeared so far and then only allowing one character in that group to actuate after that certain weight had appeared and resetting the weight count variable - Z appears more frequently in ham communications than in plain language...

And of course we use a lot dropped vowels to shorten our telegraphic communications - the kids and others with small keypads also distort the language... TKS BCNU...

austfox · Jul 20, 2013

I'd keep it simple and generate a random number between 0 and 255, and use the select case command to determine an appropriate letter. Letters such as 'z' and 'q' would only have 1 instance of occuring in 256, whilst others such as 'e' might have 10.

Code:

	random w0
	b2=w0 // 256 'random number between 0 and 255
	select b2

	case 0 to 9			
		b3=1		'letter A
	case 10 to 13
		b3=2		'letter B
	case 14 to 19
		b3=3		'letter C ... continue for entire alphabet
	case 255
		b3=26	'letter Z
		
	end select

PaulRB · Jul 20, 2013

How about this:

Get your letter distribution data (where E is 130 and the total is 1000) and put this in a data/eeprom line.

Generate a random number between 0 and 999 and put this in a temp word variable.

You will also need an index byte variable to keep your current place in the distribution data.

Start at the distribution figure for A. If your temp variable is greater, reduce your temp variable by the distribution for A. Then move your index variable on to B and loop round. Eventually your temp variable will be less than the distribution value for a letter. This is where you exit the loop and your index variable gives you the letter to send.

Paul

PS. Don't forget to keep your random seed in a separate word variable used for nothing else. Otherwise you won't get a very random sequence!

russbow · Jul 20, 2013

I am not sure that the frequency of letters in real speech has much to do with learning Morse.

When I see a written letter, I must immediately recognise the shape and assign a sound to it that was taught to me at primary school.

When I hear a morse sound, likewise I must recognise the sound and assign a "letter" to it.
What is most important is differentiating between the "sounds" to get the correct "letters"

If I wanted to bias my training program I would more likely group opposites - A / N , K / R and so on.

In learning I would want to initially concentrate on a group of five - say A to E - and when they flow move to a new group until
eventually any line would be totally random, each letter being any one out of the 26

jojojo · Jul 20, 2013

Can that help you ?

http://www.apprendre-en-ligne.net/crypto/stat/anglais.html

g6ejd · Jul 20, 2013

I wrote an article about this in RADCOM Jan 2011 or thereabouts and explained the entropy of the character distribution, well the best I could in the space available.

You can find on-line tools that will take a document and give the character frequency, then I would assign 52 memory slots and place the character in one (location 1) and its frequency normalised into the second for all 26 characters e.g. 1=A, 2=26, 3=B, 4=13, etc. 26 and 13 are wild guesses for frequency of occurrence.

Now you have the characters and their distribution you can randomly select a character but use the frequency to weight it's occurrence or usage in your 5 character groups. This should give you realistic results as if you look at the morse tree you will see the vowels occur more frequently than say a Z. Therefore as in the real world each 5-character group should contain a vowel and its more likely to be more than one.

vk6bgn · Jul 21, 2013

Hello All,

Thanks for some great ideas. I was pretty much stumped after the dice approach, which worked OK and gave predictable results. Thanks again.

I might experiment with the SIN function as westaust55 suggested. That might give some interesting results. Also, the suggestion by austfox seems to work very good using Select Case. I bashed this out last night an got it to run. (not all the code)

Code:

RandomLetters:
        
        Setfreq M4
        Gosub ReadTonePotentiometer
	Gosub ReadSpeedPotentiometer 


	Random w5
	w4 = w5 // 1000 + 1
	
Select Case w4   'b12 holds the encoded character for the Morse routine
                     'Frequency per  				
                     '1000 letters    Letter (encoded character in b12)
Case 1               '1
b12 = 19             '                Z
Case 2 to 4          '3
b12 = 25             '                X
Case 5 to 9          '5
b12 = 13             '                K
Case 10 to 24       '10
b12 = 17            '                 B
Case 25 to 40       '16
b12 = 29            '                 Y
Case 41 to 65       '25
b12 = 20            '                 F
Case 66 to 92       '27
b12 = 7             '                 M
Case 93 to 123      '31
b12 = 21            '                 C
Case 124 to 159     '36
b12 = 9             '                 D
Case 160 to 221     '61
b12 = 16            '                 H
Case 222 to 295     '74
b12 = 5             '                 N
Case 296 to 371     '76
b12 = 15            '                 O
Case 372 to 463     '92
b12 = 3             '                 T
Case 464 to 593     '130
b12 = 2             '                 E
Case 594 to 671     '79
b12 = 6             '                 A
Case 672 to 746     '75
b12 = 4             '                 I
Case 747 to 820     '74
b12 = 8             '                 S
Case 821 to 862     '42
b12 = 10            '                 R
Case 863 to 896     '34
b12 = 18            '                 L
Case 897 to 924     '28
b12 = 12            '                 U
Case 925 to 950     '26
b12 = 14            '                 W
Case 951 to 969     '19
b12 = 11            '                 G
Case 970 to 985     '16
b12 = 22            '                 P
Case 986 to 995     '10
b12 = 24            '                 V
Case 996 to 998      '3
b12 = 30             '                J
Case 999,1000        '2
b12 = 27             '                Q

End Select

bla bla bla (just bashes out the dits and dahs...)

Fed the 28X2 project board relay contact closure back to a serial DB9 on my computer and ran a Morse program to capture the first thousand letters sent..... 200 groups of 5. Then cut and pasted the results into Word and used "find" to do a count of all the letters. The results were very close to what Wikipedia said it would be. The worst letter, was the most frequent, the letter E which Wiki said would appear 130 times/1000. It actually appeared 147 times per 1000 in the PICAXE. Other letters were off by 1 or 2 as predicted per 1000. And other letters like Z appeared once in 1000 as Wiki said. Actually, many letters agreed with Wiki and were spot on in their appearence. I believe Wiki's data came from the concise Oxford dictionary.

Also, after further reading, letter frequencies can simply change from author to author writing a book. Or person to person, location to location depending who or where you are. An example that comes to mind, growing up in Los Angels for the first 3ish decades of my life, I never never ever ever used the word "mate" in normal conversation. You just called another person by their name. For the last 20 years living in Australia, it seems every 10th word spoken is "Mate". Even I use it now. And so, it appears letter frequencies can be very biased depending who,what and where. Well, at least that is how I see it. ;-)

The proper distribution of letters in spoken English???

vk6bgn

New Member

westaust55

Moderator

premelec

Senior Member

austfox

New Member

PaulRB

Senior Member

russbow

Senior Member

jojojo

Senior Member

g6ejd

Senior Member

vk6bgn

New Member