Scratchpad write/read time?

WhiteSpace · May 30, 2021

I’m just working out the quickest and most efficient way to arrange and send a series of bytes via hserout/infrared. The series consists of 80 pairs of bytes. I started off by sending each pair as it was created, then moving on create and send the next pair. I then tried saving each pair to scratchpad, then reading the entire series and sending it as a block. Doing it this way adds about 600ms to the cycle. At 160 bytes, writing and reading, that makes a little under 2ms for each write/read to/from scratchpad. Is that more or less what you would expect, or should I look elsewhere for the reasons for the additional time? I hadn’t expected it to make quite so much difference. I’m using a 28x2 at em64. Thanks very much.

AllyCat · May 30, 2021

Hi,

You haven't said what baud rate you're using, but I wouldn't expect either SEROUT or IROUT to be particularly fast (doesn't IROUT switch to a 4 MHz clock?). However, we (or at least I) need examples of the exact code structure to judge how efficient (fast) it is (or isn't). Sometimes you can get an indication of the probable speed by the number of bytes created by the PE (reported in the Syntax Check). Less bytes should be faster for a similar structure, but of course a loop will generally use less bytes, but be slower than "In-Line" code.

As you've mentioned pairs of bytes, note that the WORD operator does NOT increase the speed, and even splitting the word as pairs of bytes may be slower (create more program bytes) than a simple In-Line list of POKES. I would expect @PTRINC and @BPTRINC, and of course HSEROUT (and HI2COUT when appropriate), to be the fastest method of transferring data. It's probably not realistic to transfer 160 bytes using entirely In-Line code (but perhaps not impossible) , so maybe create and send multiple bytes within a single loop would be fast enough. I've never tested X2s, but I'd expect to be able to transmit individual bytes in less than 1 ms each at 64 MHz.

Cheers, Alan.

WhiteSpace · May 31, 2021

Thanks Alan - I'm not sure that my code is in a shape that can usefully be shared in full at the moment! I'm using baud rate 2400, because I think that any faster will be too fast for the 33kHz carrier frequency for the IR - if I understand correctly, the receiver needs around 10 cycles per burst in order to differentiate signal from noise. The HSerout is the same in both cases. The Picaxe calculates two bytes: one range and one step number for a stepper motor rotating a scanner across 120 degrees. In my original code, after calculating each pair of bytes, it sends them immediately, together with a qualifier to allow them to be identified in the right order at the other end:

[Calculate range byte and b12 which is the step number (converted to an angle for trigonometry at the receiver end)]

Rich (BB code):

hserout 0, (CheckByte); this is the "qualifier" of 255 that acts as a marker for the two real bytes
      
      hserout 0, (Range); sends range as a value up to 180, as 1/10 of the range in mm
            
      ;pause 100
            
      hserout 0, (b12)

In the second version, it calculates range and step number in the same way, but instead of sending each pair immediately, it sends them to scratchpad. Once it has done 20 steps, it retrieves and sends the 40 bytes (plus the qualifier):

Rich (BB code):

put ptr, RangeByte, b12
            ; SerTXD (#ptr, ", ", #RangeByte, ", ", #b12, CR, LF)
             inc ptr
             inc ptr

Rich (BB code):

ptr = 300
            
            for b42 = 40 to 59
                  hserout 0, (CheckByte)
                  hserout 0, (@ptrinc)
                  hserout 0, (@ptrinc)
                        
                        ;SerTXD ("Sent: ", #Checkbyte, ", ", #@ptrdec, ", ", #@ptrinc, CR, LF)
                        
            next b42

Both methods work, but the second is noticeably slower. I assume that it's because it takes a millisecond or two to send and retrieve each pair of bytes from the scratchpad? Thanks

AllyCat · May 31, 2021

Hi,

WhiteSpace said:
I'm using baud rate 2400, because I think that any faster will be too fast for the 33kHz carrier frequency for the IR -

Sending bytes at 2400 baud will take at least 10000 / 2400 = 4.2 ms per byte, which is likely to dominate the processing time and give ~ 0.42*33 = 13.8 cycles per byte. But the IR modulation rate is normally 40 kHz which would give 0.42*40 = 16.8 cycles, so it could be worth trying 4800 baud, or with HSEROUT you could use a non-standard baud rate such as 4000 baud (at both ends of course) to give exactly 10 cycles.

However, I suspect the reason that the scratchpad makes matters worse is because it's just adding additional instructions at the "wrong" time. The "under the hood" functionality of HSEROUT is probably:
1. Check that the HSEROUT Hardware Buffer is Empty (and wait until it is)
2. Put the Byte into the Hardware Buffer (which is probably quite fast)
3. { Time available to do other things }

In your Scratchpad code there are three consecutive HSEROUTs, so after the first and second the program must simply wait (about 4 ms) until the buffer is empty again. Even the FOR ... NEXT (or strictly the NEXT .. FOR) should take only around 100 us at 64 MHz, so most of the time is just being "wasted". You haven't shown what the program does between the HSEROUTs in the "main" part of the program, but I would expect the optimum speed to occur if the program were structured as:

Code:

FOR ......
     HSEROUT 0 , (checkbyte)
'     Calculate "Range"    (ideally within about 4ms)
     HSEROUT 0 , (Range)
'    Calculate "b12"     (ideally within about 4ms)
     HSEROUT  0 , (b12)
'     Do any other "housekeeping" for up to 4 ms
NEXT

In 4 ms, at 64 MHz, I would expect the PICaxe to be able to execute up to about 50 "instructions".

Cheers, Alan.

inglewoodpete · May 31, 2021

Many commands in PICAXE BASIC are actually macros, made up of the firmware's fundamental "building blocks".

Code:

inc ptr
inc ptr

will be compiled as

Code:

ptr = ptr + 1
ptr = ptr + 1

I'm not sure how many tokens that consumes, but every command has an "unpack, interpret and execute" overhead to be performed in the chip's firmware. So the following is likely to execute faster:

Code:

ptr = ptr + 2

Secondly, using the scratchpad to store data for transmission via a serial (hardware or firmware based) may come with an additional overhead too. The scratchpad is a more recent concept in PICAXEs and the original set of tokens may have needed to be extended to accommodate the new commands. Storing and retrieving data in the scratchpad could take more tokens than other methods, like high RAM. More tokens = more time to unpack and execute. Data reception to the scratchpad is likely to quite efficient, since received data is automatically placed in the scratchpad by the firmware, although the receiving chip must still retrieve and process the data.

Getting the absolute best performance from a PICAXE is a dark art. What you need to do is work out what command combinations (that produce the same results) take the least space when compiled, as well as observing the resultant speed when a command or command sequence is repeated many times.

Finally you will find that the M2 chips will perform slightly better then the X2s, for the same clock speed. The M2s use a 5-bit token size where the X2s use 6-bit tokens. (That is one of the reasons why the M2s only have 28 byte-registers when the X2s have 56.)

WhiteSpace · Jun 1, 2021

Getting the absolute best performance from a PICAXE is a dark art.

...as I am increasingly realising!

Thanks @AllyCat and @inglewoodpete for the suggestions and insight. I've tried various permutations.

The starting point, sending the 80 pairs of bytes (range and step number) to scratchpad and then retrieving and sending them via hserout/IR, took 25.43s for 10 sweeps of the scanner . Taking out the hserout instructions, so just getting range and step data, then sending them to scratchpad, took 14.61 seconds for 10 sweeps. So the retrieving and sending was taking about a second of each 2.5s sweep (which makes sense if each of the 240 bytes takes 4.2ms). Using RAM instead of scratchpad shaved off about 22ms from each sweep. I then tried:

hserout 0, (CheckByte)
hserout 0, (@bptrinc, @bptrinc)

to see whether reducing the number of hserout commands also reduced the buffering effect that Alan refers to above. It didn't make any difference.

I then rearranged the code so as to eliminate one of the subroutines in the range calculation, and eliminated sending the data to RAM/scratchpad before sending it. So it goes back to the method first described above, where it gets range and step number, and sends them immediately. That reduced the number of compiled bytes from 908 to 815, and reduced the time from 25.27s to 21.29s, or from 2.5s per sweep to 2.2s. The final step was sending the CheckByte before getting and calculating the range and step data, as suggested by Alan. This brought the total time down to 20.89s, or 2.1s per sweep.

Rich (BB code):

hserout 0, (CheckByte)
                  gosub GetRange ;
                  hserout 0, (RangeByte, b12)

Changing the baud rate to 4800 brought the total time down to 1.7s per sweep, but it resulted in a degradation of the scan quality. I'll need to do some more work to understand whether that can be recovered with a faster carrier frequency.

It's interesting to see how quite small changes in the code can make such a difference. I'll be more aware of that in future. Thanks again.

AllyCat · Jun 1, 2021

Hi,

WhiteSpace said:
... Taking out the hserout instructions, so just getting range and step data, then sending them to scratchpad, took 14.61 seconds for 10 sweeps.
...... This brought the total time down to 20.89s, or 2.1s per sweep.

In principle, you should be able to get the overall time down to about 14.6 seconds even with the HSEROUTs, because they should be performed by Hardware as a parallel (background) task. With a true "High Level" language (where the Hardware and Software are almost entirely separated by the Operating System), the User (Programmer) would simply "throw" characters at a buffer and the OS (Interrupts) would send out the characters when time is available. But AFAIK the PICaxe Basic Interpreter doesn't work that way, and a quick look at the X2 data suggests that Serial Transmit Interrupts are not supported at all?

Basically, you need to divide the program into modules (sections) which execute in between about 4 and 8 ms. Simple instructions should execute in about 50 - 100us each (at 64 MHz) but subroutine calls take several times longer and some complex instructions may take very much longer, for example a SEROUT (one byte at 2400 baud) is about 4.5 ms or a DEBUG at ~120 ms ! That's why we (or at least I) need to see some actual code to estimate if it can be made faster.

In principle you then need to put a single HSEROUT instruction (one byte) between each section of the program. The bytes transmitted might correlate directly with the program execution, but if necessary you could emulate an interrupt-driven type of system. To do that, send each byte to a buffer (as you have tried) when it is created and advance an "input" pointer (variable) to the next (empty) byte. Then, between each section of the main program, compare an "output" pointer with the "input" pointer and, if different, read a byte, HSEROUT it, and advance the pointer. But if any of the sections of the main program may take less than about 4 ms (i.e. the HSEROUT character time) then first confirm that the hardware buffer is empty (which probably needs a PEEKSFR command), otherwise skip the HSEROUT on that occasion.

Another advantage of HSEROUT is that you can use a custom Baud Rate. In the Simulator, try SERTXD(#B2400_64," ",#B4800_64) which should show the delay count for those two baud rates (6666 and 3333). Thus you can calculate the number for any intermediate baud rate, which can be inserted directly into the command, or by a custom: symbol B4000_64 = 4000 (which is mainly a coincidence). Note that this does not work with SEROUT.

Cheers, Alan.

WhiteSpace · Jun 2, 2021

Thanks again Alan. I've now managed to get it down to 17.1s per 10 sweeps (1.7s per sweep, which is another 0.4s faster than before). I changed to a 40kHz receiver (I'll explain why I was using 33kHz when I write up the whole project later) and increased the baud rate to 4800. The real win, however, came from splitting the last HSEROUT into two parts separated by another piece of code as you suggested. I was scratching my head for some time about how to do that, because the two bytes result from the same calculation during the same step of the stepper motor/scanner. But then, somewhat counterintuitively, I moved the final HSEROUT after the next step of the stepper motor, but before the Picaxe gets and calculates the next range/angle pair. So the HSEROUT sends the step number for the previous step. The details are obviously quite specific to my project, but if anyone else is similarly looking to optimise this kind of process for sending sequences of bytes, it's worth thinking about how to apply the same principle. The code now looks like this:

Rich (BB code):

Do

            ;The sequence is as follows
            ;4 quadrants - 60 degrees left (10 o'clock) to 60 degrees right (2 o'clock)
            ;L to R, quadrants 1 to 4
            ;start at 12 o'clock
            
            ;> 3 (scan) > 4 (no scan) < 4 (scan) < 3 (no scan) < 2 (scan) < 1 (no scan)
            ;> 1 (scan) > 2 (no scan) back to the beginning
            
            let OldRightStop = 255; these are part of the code to detect overshoot of the scanner
            let OldLeftStop = 255
            
            Let b23 = 0 ;marker for left or right of centre - used at line 380 or so. 
            ;0 is right; 1 is left
      
            ;from centre (12 o'clock) CW
            ;Quadrant 3 (12 to 1)
      
            for b12 = 40 to 59 ; a loop of 30 degrees CW (20 x 3 steps x 0.5 degrees)
      
                  pause 4
                  setfreq em64
            
                  readadc 13, RightStop
            
                  If RightStop > OldRightStop then gosub RightCorrect   
                  Let OldRightStop = RightStop + 15 MAX 255 ; MAX because otherwise when RightStop
                  ;is 255, OldRightStop becomes 15, which is less than 255
                  
                  hserout 0, (CheckByte)
                  gosub GetRange ;
                  
                  hserout 0, (RangeByte)
                  gosub ThreeStepsClockwise; do  step sub-procedure
                  
                  hserout 0, (b12)
                  
            next b12; and then moves on to the next quadrant

Also worth noting: while looking at ways of speeding up the process of sending serial via IR, I found these IR receivers: https://www.vishay.com/docs/82667/tsdp341.pdf They are specifically designed for sending serial via IR. They need only 6 cycles of the carrier wave per bit, rather than 10+ as required by the TSOPxxxx/x Vishay IR remote receivers. That means that the 57.6kHz versions can handle baud rates of up to 9600. I've ordered a couple, and I'll report how well they work.

julianE · Jun 12, 2021

TSDP341 is completely new to me, I look forward to your work with it.

Scratchpad write/read time?

WhiteSpace

Well-known member

AllyCat

Senior Member

WhiteSpace

Well-known member

AllyCat

Senior Member

inglewoodpete

Senior Member

WhiteSpace

Well-known member

AllyCat

Senior Member

WhiteSpace

Well-known member

julianE

Senior Member