Redundancy, self checking and fail-over

krypton_john · Jan 19, 2009

Hi

Just wondering if anyone had any experience in implementing a redundant fail-over picaxe based system? I'd be interested to hear how it can be done. In particular I wonder how a picaxe could be monitored to ensure it is doing it's job properly, getting valid readings from it's sensors etc.

Anyone?

Cheers
JohnO

hippy · Jan 19, 2009

Some info here to save repeating things ( it's amazing how quickly posts drop down the thread index ) ...

http://www.picaxeforum.co.uk/showthread.php?t=11374

Redundancy is somewhat more complicated than simple failure detect. The usual approach is to have voting on multiple sources and make a best guess at which is telling the truth. Each source would have its program written by a different team to prevent them all from exhibiting the same accidental errors.

Checking for valid readings is a case of is it within expected range, and is it within expected range of where it was last time. Except for nuclear annihilation one wouldn't expect a room thermometer to jump a 100'C etc. Two sensors should give a similar reading, sensor results should complement each other; it shouldn't be 'dry' while rain flow is measuring as a bucket-load per second.

Buzzword bingo aficionados would probably conjure up, "is it it working within its dynamically variable envelope of expected and predictable operation".

The "Byzantine Generals Problem" will be interesting reading via Google.

leftyretro · Jan 20, 2009

There was lots of money and time spent on implementing redundancy in control systems for the oil refinery I worked in. It can be quite complex and never 100% fool proof, but with downtime costing $1M a day for many of the larger plants it was foolish not to utilize as much redundancy as possible. On critical equipment safety systems and automatic plant shutdown systems there would be redundancy out to the very sensors (2 out of 3 voting) and different routing of cabling etc.

One lesson we learned was that one can't claim redundancy if the system can't detect a failure in one of the redundant members. That is to say if you were using 2 24vdc power supplies and diode ORing them together to give you a redundant 24vdc power source, if one supply failed and the system didn't have the means to automatically detect and alarm on that condition then one is set up for a system failure when the now single working power supply eventually fails.

So it's a matter of analyzing a system and finding all the single point failure modes and figuring out a redundant function for it. Some times it took a formal FMEA (failure modes and effects analysis) study to identify and validate system design.

Lefty

Dippy · Jan 20, 2009

Jeez, I hate jargon, but it impresses small boys.

All of the above -
plus in the Nuclear industry many of the detectors were in clusters of 3 and the 2-out-of-3-in-agreement principle was used.
If one sensor was repeatedly in disagreement then it could be spotted quickly and replaced, without recourse to the blue-sky-thinking and other cringeworthy phrases. It also meant that the redundancy gave a bit of extra time to fix the detector failure without immediately stopping the system.
(i.e. having a spare one gave a bit of breathing space - apologies for the lack of clever jargon

)

None of our reactors blew up so it must have worked.

jglenn · Jan 20, 2009

In the old days, when cpu's were 40 pin dips, and memory chips held 2K words, we sometimes rolled our own watchdogs. The classic circuit is a capacitor with a pull up resistor, set to charge in about 10mS, or whatever watchdog interval you want.

A transistor or comparator causes a reset pulse to be emitted when the cap charges, resetting the cpu. The program has to output a pulse every 1mS or so, which turns on a transistor that shorts out the cap. So if it gets to charge, the program has failed.

Of course, there are weird software failures where the pulses still come out, but the program has failed.

Guess you have to be clever.

hippy · Jan 20, 2009

One thing to check for is activity, rather than lack of it. Thus a capacitor that charges up to flag an error doesn't work if a failure is holding it low, but a capacitor which is both charged up and discharged down alternatively about some nominal voltage with a comparator checking it remains within a valid range is an improvement. That still wouldn't catch a failure which was rapidly switching between charge and discharge so the next step is to check the rate at which changing from charging to discharging occurs.

Even easier with a digital system - Toggle an output say every 10ms, something else monitors that, checks that a toggle always occurs within 9ms to 11ms of the last change to cater for a bit of discrepency and if that's not happening something has gone wrong.

A good security alarm system doesn't have open-close loops, they usually monitor a voltage so going open or short-circuit will get detected.

Dippy · Jan 20, 2009

"Toggle an output say every 10ms"
- quite so, and this type of 'watchdog' has been in existence for a 1000 years.
It is just so easy that I really don't know what all the fuss is about.

Simple 555 based 'missing pulse' detection is easy. And there are numerous other ways to do this with ics or discretes.

Redundancy, self checking and fail-over

krypton_john

Senior Member

hippy

Technical Support

leftyretro

New Member

Dippy

Moderator

jglenn

Senior Member

hippy

Technical Support

Dippy

Moderator