Motor Unresponsive

jonovos · October 22, 2018, 8:02pm

I have an intermittent problem. Perhaps a few times per week, the motors on my Magni will NOT MOVE in response to commands. I can source commands using teleop-twist-keyboard or the Logitech game controller. But when this problem is happening, the motors either will not move at all, or will turn a little, in spurts, and sometimes pause for long periods. There have even been times where I have held the joystick all the way forward, and fifteen seconds later, the wheels will operate normally - only to fail again five minutes later.

After a reboot, this problem is likely gone. However, some days, exhibit nearly 100% failure, even after a reboot. (It’s almost like a weather system statistical equivalence.) ~~ Anyway, while the failure is happening, I can inspect various things in ROS (i.e. cmd_vel) and confirm that the commands are getting through.

I sometimes get the feeling that some other system on the R-Pi is interfering with the transmission of the command to the motors.

Any ideas?

rohbotics · October 22, 2018, 9:07pm

Make sure the batteries are charged? rostopic echo /battery_state The voltage should be 24 or more.

Otherwise it could be a faulty motor board, please email dc@ubiquityrobotics.com

Rohan

davecrawley · October 23, 2018, 4:04am

Absolutely the right thing to check battery voltage. The battery state topic gives a very rough idea of what the battery is doing. More accurate results can be obtained with a multi-meter. Both very high and very low readings are a problem, as the safety system will lock out both over and under-voltage. Battery voltage should be between 27.2V and 21V (although the battery state topic can be a volt or two off so I wouldn’t count on it for diagnosis its there to give a general impression of whether the battery is dead or not). Important to realize that there are separate lock-outs for motor and the computer - and we designed the system so that the computer would keep gathering data even if the safety systems shut down the motor.

When the robot is operating normally the wheels are dynamically held in place by the motor controller. This will happen regardless of whether there is connectivity with the main computer or not. So my question for you is when the motors do not move do the wheels lock in place or do they free-wheel? When they lock do they exhibit the same amount of resistance as when the robot is operating normally?

Also how do you switch the system on? Do you switch on main power and motor power simultaneously, or do you switch one then the other? How long do you wait between them?

jonovos · October 23, 2018, 12:45pm

Thanks for the responses above.

I will repeat this one fact: When I reboot, there is a good chance this motor issue will fix itself.

With electrical engineering insight to these regular battery issues, I can assure you that I have measured ~26 Volts supplied from the batteries, by measuring from these two hex-screws on the motherboard, while the unit is powered ON, including the motor enabling switch.

We have two ubiquity Magni’s. BOTH exhibit the same behaviors, randomly. I have ensured the batteries are properly charged as well.

As soon as I saw the motor intermittent issues, I thought about the battery levels right away. I depend on measuring the value directly with a DVM. I found it a little unusual that rostopic echo /battery_state reads the battery voltage consistently a little bit higher than measured, usually 2.7-volts higher. (i.e.: This morning, measured 25.4-V; /battery_state indicates 27.95…28.35)(Both do not drop at all while under motor-load.)

Power-up sequence = Usually I power-on with the Black-button ONLY. After 60-120 seconds of boot-up, I power-on the Red-button. It occurred to me the possibility that maybe somehow some circuit was not satisfied when power for the motors was sensed-“off” at boot-up, so I have tried to power-up BOTH nearly simultaneously. So far, with perhaps many dozens of power cycles, in these various ways, I feel no association with this motor problem.

In fact, when this motor problem happens, I have even power-cycled OFF and ON again (the red motor enabling button), with varying delays of 0.5 - 10-seconds.

At this time for us, it is important that these systems be operable, with-OUT having to interact with them through a Linux desktop, or ssh-ing otherwise. Our primary use cases now require letting a unit power-up, and perhaps a minute later, be able to teleoperate it with the Logitech game pad.

Motor-LOCK: I will look for this next time, when the issue occurs.

jonovos · October 23, 2018, 1:01pm

UPDATE~ P.I.D. ~~ Try this fun experiment:

Unit powered ON, and motors responsive to Logitech controller.
Turn OFF red power button. (AND leave it OFF)
Logitech controller = press both the LT-button and joystick-Forward for 3-seconds, and release.
Turn ON red power button.

The motors will lurch forward, for a little over a second, and stop, presumably because the PID controller has sensed inadequate movement from the prior Logitech command while disabled.

This might be a bug.

jonovos · October 24, 2018, 1:30pm

I can see now that when this happens, the motor_node dies.
UNKNOWN CAUSE

How can I gracefully and properly restart the broken node(s)?

Portion of log below. (more available upon request)

BTW#1: I verified that the Logitech game controller is working and cmd_vel contains data.
BTW#2: While the motors were unresponsive here, the motor controllers kept the wheels stationary, and resisted movement.

ubuntu@cory:~$ roswtf
Loaded plugin tf.tfwtf
No package or stack in context
================================================================================
Static checks summary:

No errors or warnings
================================================================================
Beginning tests of your ROS graph. These may take awhile...
analyzing graph...
... done analyzing graph
running graph rules...
ERROR: connection refused to [http://cory.local:34465/]
... done running graph rules
running tf checks, this will take a second...
... tf checks complete

Online checks summary:

Found 2 warning(s).
Warnings are things that may be just fine, but are sometimes at fault

WARNING The following node subscriptions are unconnected:
 * /tf2_web_republisher:
   * /tf2_web_republisher/cancel
   * /tf2_web_republisher/goal

WARNING These nodes have died:
 * motor_node-4
 * controller_spawner-3


Found 2 error(s).

ERROR Could not contact the following nodes:
 * /motor_node

ERROR Errors connecting to the following services:
 * service [/motor_node/set_parameters] appears to be malfunctioning: Unable to communicate with service [/motor_node/set_parameters], address [rosrpc://cory.local:38081]
 * service [/motor_node/set_logger_level] appears to be malfunctioning: Unable to communicate with service [/motor_node/set_logger_level], address [rosrpc://cory.local:38081]
 * service [/controller_manager/switch_controller] appears to be malfunctioning: Unable to communicate with service [/controller_manager/switch_controller], address [rosrpc://cory.local:38081]
 * service [/controller_manager/load_controller] appears to be malfunctioning: Unable to communicate with service [/controller_manager/load_controller], address [rosrpc://cory.local:38081]
 * service [/controller_manager/list_controllers] appears to be malfunctioning: Unable to communicate with service [/controller_manager/list_controllers], address [rosrpc://cory.local:38081]
 * service [/controller_manager/unload_controller] appears to be malfunctioning: Unable to communicate with service [/controller_manager/unload_controller], address [rosrpc://cory.local:38081]
 * service [/controller_manager/list_controller_types] appears to be malfunctioning: Unable to communicate with service [/controller_manager/list_controller_types], address [rosrpc://cory.local:38081]
 * service [/controller_manager/reload_controller_libraries] appears to be malfunctioning: Unable to communicate with service [/controller_manager/reload_controller_libraries], address [rosrpc://cory.local:38081]
 * service [/motor_node/get_loggers] appears to be malfunctioning: Unable to communicate with service [/motor_node/get_loggers], address [rosrpc://cory.local:38081]

ubuntu@cory:~$ 
ubuntu@cory:~$ 
ubuntu@cory:~$

rohbotics · October 25, 2018, 2:58am

Do you have an RTC battery (CR2032) installed?

Is the robot continuously connected to the network? This sounds like it could be due to time synchronization issues.

Rohan

jonovos · October 25, 2018, 2:34pm

Hi, Rohan, I comprehend where you are going with this question…

I just looked around the motherboard, but I do not even see any holder for a CR2032. (reminder~ This is a Magni.)

–(NEVER MIND)–Could you please give me a hint where to look?

UPDATE~ I found the cr2032 socket. We will endeavor to remedy this.

o o o o o o o o o ~H~O~W~E~V~E~R~
I do not see how this line of investigation would have any impact on the ON then OFF then OFF then ON then OFF behavior, AFTER several minutes of perfect use. PLUS, it happens so very randomly, long after system Clock has been updated via WiFi, from network services.

davecrawley · October 26, 2018, 3:54am

Well I’d just give it a try. You are going to need the CR2032 anyway!

In general crashes because of a bad clock are caused by buffer overflow errors generally when the node tries to calculate something based on two times (e.g. speed or distance driven). Speed is of course (d1-d2)/(t1-t2) when 1 and 2 are two points in time and d and t are position and time respectively. Obviously if time is measured in mili-seconds and there is a 40 year jump then you can overflow even quite a large buffer or cause other un-predictible results (e.g. speed rounds to zero even though other parts of the algorithm assume its non-zero and divide by it, etc.)

For this to matter you need:

a jump in system time
a calculation going on at that moment that results in a buffer overflow error if there is a big jump in system time

As such when and whether its going to crash is often unpredictable - because when network time comes up is unpredictable and we don’t stop stuff working because the RTC is obviously not working (Perhaps we should? What do people think?). Generally nodes crash when it acquires network time and the whole system experiences a huge jump forward in time, but that’s not always the case. Also stuff can crash and it can take a while before you notice. For example if network time comes up before the motor node does, then there will not be any problem. If however it takes a few seconds to come up for whatever reason and the node is running and at that moment attempting to execute say a speed calculation it might crash. You can then sit around for several minutes with perfectly good network time, but no motor node, and then try to drive the robot only to discover that it isn’t working.

mjstn2011 · October 26, 2018, 6:50am

Hello, I am a hardware engineer at UR. If you decide that all ROS processes are running fine I have a different thing to check in the event that this is something we sometimes see after a motor lurch situation which you report.

In production we found that a motor power protection circuit was put in that once triggered does not retry. We are fixing this with the correct version of the chips on the current board run but that does not help you right now, I know.

So to see if you have the hardware related ‘latching protection’ issue please see my post from 10 minutes ago in this forum post:

Report back to This thread if your symptom shows the missing 24V where I mention it must be present.

Also we are now adding leds to show the post-power-protection voltage on next board rev in production now (again, I know it does not help BUT we are always improving things as we find them. )

Thanks, Mark

jonovos · October 26, 2018, 8:18pm

Mark,

I looked at your other post, and I understand your concern that the motor protection circuit might be tripped.

I tried to follow your explanation in the other post ((Motors don’t work)). However, I do not know, from your description, exactly where I should measure to see the presence(or)absence of the 24V [switched] voltage.

Could you would please upload an image showing exactly which mosfet to examine, and exactly which lead. It will also help if you specify part name (i.e.: M903_ECB_MOT)

Even so, it is strange that there are times when the motors will sometimes become operative, after some random time. I can’t imagine how this behavior can be resolved against a non-recovering motor protection circuit.

mjstn2011 · October 26, 2018, 11:21pm

I agree that if the motors respond at some later point PRIOR to a full power off then the motor power does not seem to be the issue. Sorry I have confused the diagnosis if this issue. Also just now I have added a picture because you are correct, I should have included a picture for clarity. Yes it as M903_ECB_MOT that I was discussing but had not specified that so sorry about that lack of picture.

jonovos · November 2, 2018, 3:58pm

Hi, Mark,

Yesterday we took one of our Magni’s outside for a drive around the block, along with a simple data-gathering payload. Its motors randomly stopped (and resumed) at least thirty(30) times.

Our payload = We mounted a Lidar on top, connected to the Raspberry-Pi-3B-plus(+) through wired ethernet cable. Plus, we had a 9600-baud USB-serial connection with a simple (Adafruit) GPS receiver. We were writing captured data (ROS bag’s) to the local micro-SD card, once every second. Throughout our journey, we captured approximately 6gb of ROS-bag data.

While driving the Magni, we experienced the “stuttering” motors randomly (ranging from once per minute to once every ten seconds). Typically, we would be teleoperating it, and the rover would just randomly STOP, and then less than half a second later, it would LURCH FORWARD, like it was trying to catching up, due to its brief interruption.

I recognize this lurch behavior to be originating from the PID control algorithm, which, after it resumes from the interruption, instantly produces a large “delta” error. This error translates into a burst of speed current, briefly, until the rover recovers.

This behavior out in the field resembles the error I have described in this Motor-Unresponsive thread. It is just far worse, while we were driving it, with vibrations, with a large software load as well.

I suspect you guys at Ubiquity have implemented the motor PID as an ordinary user-space task. If so, it would be subject to preemptions, due to LUbuntu’s process loading.

Please comment on my observations.

mjstn2011 · November 2, 2018, 6:57pm

The PID loop is implemented on the Magni Controller board processor and not in the non-realtime ROS code that runs on the Raspberry Pi Linux OS. It is possible that there is some ROS side interaction but I’m not up to speed on those interactions. We do have an active program to work on speed control as we understand it’s importance.

jonovos · November 2, 2018, 7:07pm

Ok, thanks Mark.

Please, whoever knows about these things, keep these comments going. I feel then that this topic needs more clarity.

Obviously then, there is some interaction between ROS and the dedicated slave-PID controller. We need to understand the edges of the software’s capability, so we can avoid triggering this problematic behavior.

davecrawley · November 4, 2018, 3:25am

This is a very interesting observation.

The PID loop runs on the micro-controller of the motor controller and is independent of the raspberry pi completely. However as you may know the motor controller is controlled through a dead man timer. The motor controller expects new motor drive commands from the raspberry pi every 1/10th of a second (its a parameter that can be set). If it doesn’t receive a new command within the deadman time it assumes that the raspberry pi is unresponsive or dead - so it stops the robot. This is obviously for everyone’s safety - you don’t want the robot driving around when the computer controlling it is unresponsive.

Did you have a sense for how much processor load you were putting on the to the RPi. I don’t know what type of LIDAR you were using, but if you were shoveling data on to the RPi and loading with other data intensive tasks its possible you could have done so in such a way that the motor node didn’t send out a new command every dead man time. This is a possible explanation of the effects you were seeing.

It obviously easy to attach an additional computer to the robot of course - there are power connectors and attachment points galore - that second computer could take on some of the resource intensive tasks. That might be an easy solution. The other thing is to understand how you are loading the RPi to see if you can tune out the behavior. You can always extend the dead man timer to see if that improves things - but that is surely not recommended operating practice.

jonovos · November 4, 2018, 4:44pm

Hi,

I congratulate you on reading my mind Dave. I recognized the likely existence of a dead man timer in this control loop. Thanks for confirming my suspicions.

I do not want to mess with the 0.100-seconds setting in the micro controller. I concur with your needs for this for safety. I believe it SHOULD be the responsibility of the R-Pi system to maintain tickling the motor controller.

I am rapidly converging on an understanding of the capabilities of the R-Pi for operating our payloads for the future. I have already begun our in-house argument to offload processing to a “payload computer”.

Performance-Envelope ==> Admittedly, the LiDAR apparently operated with a high-speed (> 100-Mbps) ethernet cable. It is a 3D LiDAR, with 64-simultaneous scan-lines. It is easy to imagine it was consuming enough CPU cycles to cause momentary motor_node latency, w.r.t. its 100-milliSecond dead-man timing requirements.
~ ~ ~ ~ ~ ~ HOWEVER, it remains a MYSTERY, how, with NO PAYLOAD software load, and (…This may mean something…) with the Magni mounted on BLOCKS (no-longer in contact on any driving surface) , …that, occasionally, the Magni motors are unresponsive, accompanied WITH a motor “stutter” like most recently described., OR, accompanied with a motor_node crash, sometimes.

davecrawley · November 4, 2018, 6:45pm

That is interesting - yes I’d expect a 3-D LIDAR shoveling data down ethernet and writing this to the SD card might very well overwhelm the RPi.

Agree a payload computer for more compute intensive tasks is probably the way to go. As mentioned there are power connectors that enable a wide variety of options.

The lurching forward that you describe earlier in the thread that occurs when the motor power is disengaged and re-engaged - that’s a bug and we are planning to fix it. It sounds like you have a different issue though - to help us understand it better perhaps you could post a video - thanks. PID loop is possible as the default setting is for a loaded robot and when you put the robot on blocks you change the dynamics of the system. Other potential things to look at include:

-Check you have the RTC battery (its a CR2032 on the back side of the board) and that your RPi is reporting roughly the correct time.
-Make sure that 12 and 5V power connectors are actually generating 12 and 5 V power (this was checked in the factory but it might be worth checking it again) you’d put a voltmeter on the molex connectors at the top of the board. You’d need to check both molex connectors as there are 2X 12V supplies and 2X5V supplies. Technically only the 5 and 12V main will matter here but might as well check all of them.
-Make sure the motor node is alive and well throughout the test

jonovos · November 6, 2018, 6:25pm

I am posting a link to an edited video file showing some of the difficulty observed with steering downhill (see other issue), and the motor halting problem described above.

Sadly, we weren’t recording when the motor was having dozens of these motor-stopped interruptions. But, we still found one really good example. The edited file below repeatedly shows this one good example.

0:00 … 0:15 = steering trouble downhill #1
0:15 … 0:30 = steering downhill, resolved by Reverse
0:30 … 0:40 = motor runaway, until buffered commands all eventually de-queue’ed
0:42 … 1:38 = steering trouble downhill #2
1:38 … 1:41 = motors STOPPED, and then resumed (motor unresponsive)
1:45 = motors stopped (repeated)
1:53 = motors stopped (re-repeated)

drop box video

pyhuang2011 · December 6, 2018, 4:51am

This is Pu from Rapyuta Robotics, we plan to change the lead-acid battery to Li-on battery.
%E5%9B%B31
with the 27.2V ~21V working voltage, I am considering to use 6S 18650 or 7S, the typical curve attached
6S: 18V-25.2V
7S: 21V-29.4V
so my question is:

what is the physical working voltage of the robot?
can we set up the over voltage/under voltage threshold, or disable it?