2020-05-25

Weather station woes and fixes

I've set up some Raspberry Pis to export data from various FineOffset weather stations, using WeeWX as the server software periodically downloading records and presenting them graphically in a webpage. Each weather station comes with a dedicated console to receive transmissions from the outdoor sensors every minute or so, and the console has a USB socket to allow a host to configure it and extract data.

This console is known to have a “USB lock-up” bug, whereby it refuses to talk to the host after some random period (from a few days to a couple of months), even though it had been interacting successfully prior to that point. The only robust work-around is to power-cycle the console, which is not easy to automate. Here's what I had to do.

Detection

The lock-up bug now appears as Operation timed out in the WeeWX log, as given by sudo /bin/systemctl status weewx:

fousb: get_records failed: [Errno 110] Operation timed out

You can get essentially the same lines from grep weewx /var/log/syslog. Four of these appear (about 45 seconds apart), and then WeeWX seems to reconnect in vain, and gets another four, and so on. This cycle lasts about 3½ minutes.

Note that the WeeWX documentation on the matter identifies a different error:

could not detach kernel driver from interface

Maybe that means this isn't really the lock-up bug I'm getting, but the symptoms and treatment seem to be the same.

Recovery

You have to take the batteries out of the console, and ensure it is disconnected over USB. You can run the console off USB power alone, so for an unattended power cycle, you just need a USB hub that can depower its sockets. The big disadvantage of leaving the batteries out is that no readings are taken during a power cut; with the batteries in, you could at least pull them off the console when power returned, as it can store several days' worth.

Here are the Pis I'm using with each weather station:

Hostname Host model Weather station model WeeWX version
fish RPi 3B+ WH3083 3.9.2
ruscoe RPi 3B WH3083 3.9.2
kettley RPi 3B WH1080 3.9.1

All are running some version of Raspbian.

I've used uhubctl to check for and invoke the power-cycling feature, and can confirm that both the RPi 3B and 3B+ can control the power on their USB sockets. I also tried an RPi Zero W, which would have had the ideal amount of grunt for the task, but it's unable to control power on its sockets. Since I've not seen the problem on the WH1080, it could be used there, or indeed on any similar set-up with a different type of weather station. I was using an older RPi model at some point (with no built-in Wi-Fi); it could power-cycle its entire USB hub, although this included the USB Wi-Fi chip!

The output of sudo uhubctl looks something like this (on a 3B; it's marginally different on the 3B+):

$ sudo uhubctl 
Current status for hub 1-1 [0424:9514]
  Port 1: 0503 power highspeed enable connect [0424:ec00]
  Port 2: 0100 power
  Port 3: 0100 power
  Port 4: 0303 power lowspeed enable connect [1941:8021]
  Port 5: 0100 power

1941:8021 is the weather station console:

$ lsusb 
Bus 001 Device 014: ID 1941:8021 Dream Link WH1080 Weather Station / USB Missile Launcher
Bus 001 Device 013: ID 0424:ec00 Standard Microsystems Corp. SMSC9512/9514 Fast Ethernet Adapter
Bus 001 Device 002: ID 0424:9514 Standard Microsystems Corp. SMC9514 Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

This means that a command of the following form will power-cycle the console:

sudo uhubctl -l 1-1 -p 4 -a 2 -d 30 -R

Update: On fish, I had to set -p to one less than the reported port! sudo uhubctl said it was in port 3, but the command only worked on port 2. It helps to have someone looking at the console to confirm when you're doing it remotely!

  • -l 1-1 -p 4 are taken from the output of uhubctl, identifying the hub and port.
  • -a 2 causes a power cycle, rather than switching on or off.
  • -d 30 keeps it off for a generous 30 seconds. That maybe could be trimmed a bit.
  • -R resets the hub, forcing devices to re-associate. I found this to be essential, and wonder if it would be effective without the power cycle. Update: It isn't; you must remove the batteries.

Putting it together

A script, in ~/.local/bin/check-weather-station:

#!/bin/bash

count=0
while read line ; do
    if [[ "$line" == *"get_records failed: [Errno 110] Operation timed out"* ]] ; then
        ((count++))
    elif [[ "$line" == *"Stopping LSB: weewx weather system"* ]] ; then
        count=0
    fi
done < <(grep weewx /var/log/syslog | tail -50)

if [ $count -ge 4 ] ; then
    printf >&2 'Fault detected, power-cycling...\n'
    echo >&2 'Stopping station software'
    sudo /bin/systemctl stop weewx
    echo >&2 'Power-cycling hub'
    sudo /usr/sbin/uhubctl -l 1-1 -p 4 -a 2 -d 30 -R
    echo >&2 'Waiting for end of sensor-learning period'
    sleep 180
    echo >&2 'Setting time'
    sudo /usr/bin/wee_device -y --set-time
    echo >&2 'Setting interval'
    sudo /usr/bin/wee_device -y --set-interval=5
    echo >&2 'Restarting station software'
    sudo /bin/systemctl start weewx
fi

A cron job then checks every few minutes:

*/3 * * * * $HOME/.local/bin/check-weather-station

That should pick up the fault within one 3½-minute cycle.

Other aspects of the script:

  • Four “timed out” messaged are awaited. Maybe I could get away with two, or even one!

  • The weewx service is suspended during the reset. This ensures there's no interaction with the console shortly after it comes back on.

  • While the service is suspended, we don't want overlapping invocations of the script to do anything. This is detected by resetting the message count whenever we see that the service has been stopped. Only “timed out” messages that are not followed by a “stopping” message are counted.

    (There's a potential race condition here, but it's not going to happen unless parsing the log and stopping the service take more than 3 minutes.)

  • Waiting three minutes after the reset ensures that the console's sensor-learning mode is not jeopardized by external activity. The weather-station manual warns about key activity on the console during this time, and I suspect it actually extends to USB activity too. I've been very cautious, so it might be possible to trim the timing a bit.

  • The console's time is synchronized with the host's. This can only be done while the service is stopped. (Unfortunately, this does not seem to update the clock displayed in the console.)

  • The logging interval is set to 5 minutes. Apparently, the console can sometimes forget this after a power cycle, but this setting is thought to reduce the likelihood of lock-ups.

Other issues

  • One Pi wouldn't come back on after a power cut. Changing the power supply fixed that.

  • Another Pi seems to lose its Internet connection, but continued gathering data from the console. Being headless, the simplest thing for a non-technical person to do is to power-cycle the Pi, but that's a bit drastic, and undermines the goal of unattended operation. I tried the following to prevent the Wi-Fi from going to sleep, but it still happened:

    sudo iw dev wlan0 set power_save off
    

    I've resorted to pinging the router once a day.

Results

With a cruder detection mechanism (one that took about four or five log cycles to get lucky), I've seen the script work twice in just a couple of days. I'm trying out this new detection mechanism above, which should be safe to use as often as every three minutes, and so it should be able to detect the first cycle. I'll update this article as things develop.

2020-02-02

Removing variable prefixes and suffixes from other variables in Bash

Just been bitten by this…

If you have a variable txt in Bash, you can strip a given prefix or suffix from it like this:

$ txt=a/b.d/c.jpg
$ echo "${txt%.*}"
a/b.d/c
$ echo "${txt##*.}"
jpg
$ echo "${txt%%.*}"
a/b
$ echo "${txt#*.}"
d/c.jpg

The % operator strips of the shortest matching suffix, and .* matches .jpg, so that gets removed. %% strips off the longest matching suffix. Similarly, # and ## strip off the shortest and longest matching prefix, respectively. Asterisks, square brackets and other characters are special, probably following the same rules as Pattern Matching in the Bash manual page.

You can also use literal strings as the patterns, i.e., no special characters:

$ txt=a/b.d/c.jpg
$ echo "${txt%.jpg}"
a/b.d/c
$ echo "${txt%.png}"
a/b.d/c.jpg
$ echo "${txt#a/b.d/}"
c.jpg
$ echo "${txt#c/b.d/}"
a/b.d/c.jpg

Note that, if the prefix or suffix doesn't match (whether you use special characters or not), you get the whole string returned.

These operations are useful for traversing pathnames:

$ path="/home/john/file.jpg"
$ echo Leaf is "${path%%*/}"
Leaf is file.jpg
$ echo Dir is "${path#/*}"
Dir is /home/john

You have to be careful if your input doesn't contain the separator:

$ input1=path/to/file.jpg
$ input2=file.jpg
$ echo Input 1 dir "[${input1%/*}]" leaf "[${input1##*/}]"
Input 1 dir [path/to] leaf [file.jpg]
$ echo Input 2 dir "[${input2%/*}]" leaf "[${input2##*/}]"
Input 2 dir [file.jpg] leaf [file.jpg]

To avoid this special case, I thought I could do this:

input1=path/to/file.jpg
input2=file.jpg
input1leaf="${input1##*/}"
input1dir="${input1%${input1leaf}}"
input2leaf="${input2##*/}"
input2dir="${input2%${input2leaf}}"
echo "[${input1dir}]" "[${input1leaf}]"
echo "[${input2dir}]" "[${input2leaf}]"

…which leads to this:

[path/to/] [file1.jpg]
[] [file1.jpg]

However, I hadn't noticed that special characters are still interpreted after the partial expansion:

input3="path/to/file [2002].jpg"
input3leaf="${input3##*/}"
input3dir="${input3%${input3leaf}}"
echo "[${input3dir}]" "[${input3leaf}]"

The square brackets are taken as a wildcard, and fail to match the literal value:

[path/to/file [2002].jpg] [file1 [2002].jpg]

The trick is to quote again:

input3="path/to/file [2002].jpg"
input3leaf="${input3##*/}"
input3dir="${input3%"${input3leaf}"}"
echo "[${input3dir}]" "[${input3leaf}]"

Now you get the intended result:

[path/to/] [file1 [2002].jpg]

An alternative technique would be to use the length of your prefix/suffix in a substring operations, but it's less convenient and more error-prone if you want to do small adjustments to a prefix or suffix before applying it.

Anyway, in summary, if you're going to use Bash's prefix/suffix removal with a computed pattern, put the result in quotes!

Bash redirection with descriptor in variable, and locking

A recommended way to acquire a lock in Bash is to open the lock file for a group command, and call flock on the open descriptor before doing anything dangerous:

{
  echo waiting
  flock -x 9
  echo in
  sleep 10
  echo done
} 9> /tmp/lock

Try it in two independent terminals. The second command will run only as the first finishes.

However, one should never have to pick an arbitrary file descriptor (9 in this case). Fortunately, you can get Bash to choose an available descriptor, using {var} in place of the literal descriptor number:

unset lfd
{
  echo waiting
  flock -x $lfd
  echo in
  sleep 10
  echo done
} {lfd}> /tmp/lock

Problem solved!

No, wait. The descriptor doesn't get closed at the end of the group command, so your second invocation will hang indefinitely. Once the first terminal has finished, if you manually close the descriptor, the second proceeds:

exec {lfd}>&-

Looks like you have to do things more explicitly (and the group command is no longer useful):

echo waiting
unset lfd
exec {lfd}> /tmp/lock
flock -x $lfd
echo in
sleep 10
echo done
exec {lfd}>&-

This is inconvenient if you want to break or continue out of an enclosing loop:

for i in $(seq 1 10)
do
  {
    echo waiting
    flock -x 9
    echo in
    sleep 5
    if something_went_wrong ; then continue ; fi
    sleep 5
    echo done
  } 9> /tmp/lock
done

Is this a bug, a feature, or a mistake on my part? (Bash version 4.4.20(1)-release.)