Discussion:
[lm-sensors] w83795 fan control not working
Jean Delvare
2011-04-07 13:00:07 UTC
Permalink
Hi Darren,

I am redirecting this discussion to the right mailing list.
I haven't been able to control the fan speed using the w83795 driver.
The BIOS "Quiet" setting appears to be braindead as it runs quietly for
a while and then switches to near full throttle for a minute or so and
then returns to the previous state (this is with the system basically
idle). The temperatures (from w83795adg-i2c-0-2f) never reach anything
At least, if the BIOS has a "Quiet" setting, this suggests that the
hardware is designed for fan speed control.

Do you see any message in the kernel logs when the fan switches to high
speed?
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
This is very hot.
temp5: +40.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +29.5?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +25.5?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
...
OK, waited 10 minutes and it didn't want to scream at me. But if memory
serves, there is only a variance of a few degrees before the fans kick
in.
None of the measurements above is anywhere close to its set limits, so
this behavior isn't caused by an alarm raised by the W83795ADG.
I'm hoping to use pwmconfig/fancontrol with the w83795 driver to restore
some sanity to the fan usage. I tried with V 0.7 on the Ubuntu 10.10
server kernel (vmlinuz-2.6.35-22-server) as well as with the current
version in the linux-2.6.git tree (2.6.39-rc1+). I'm running on the
following hardware with a pair of Intel Xeon X5680 CPUs.
SUPERMICRO MBD-X8DTL-iF-O Motherboard
http://www.supermicro.com/products/motherboard/QPI/5500/X8DTL-iF.cfm
linux-2.6.39-rc1+: 99759619b27662d1290901228d77a293e6e83200
$ grep 83795 .config
CONFIG_SENSORS_W83795=m
CONFIG_SENSORS_W83795_FANCTRL=y
$ lsmod | grep 83795
w83795 43879 0
---------------------------
hwmon0/device is max1617
This would be very surprising and smells like a misdetection. Which
could, in turn, explain (some of) your problems. What the use of the
adm1021 driver suggested by sensors-detect? I presume that the output
for the supposed max1617 chip in "sensors" is plain wrong? I would
advise that you do not load the adm1021 driver.
hwmon1/device is w83627dhg
Super-I/O (multifunction) chip, probably not used for monitoring.
Unloading the w83627ehf driver would make running pwmconfig much easier.
hwmon2/device is w83795adg <--- So it found the device
hwmon1/device/pwm1
hwmon1/device/pwm2
hwmon1/device/pwm3
hwmon2/device/pwm1
hwmon2/device/pwm1 stuck to 125 <--- This doesn't look good.
Manual control mode not supported, skipping hwmon2/device/pwm1.
Indeed. This suggests that the driver wasn't able to switch this fan
output to manual mode. The strange thing is that it works for me, with
the same chip on a different board (lm-sensors 3.3.0, kernel 2.6.38.2.)
hwmon2/device/pwm2 <--- Which fans does it control?
The next steps in pwmconfig should tell. One thing worth noting is that
you have 6 fan inputs used on the W83795ADG, but the chip has only two
fan control outputs. So it is impossible that you have one control per
fan. On my board, pwm1 controls both CPU fans and pwm2 controls all 6
case fans.
Giving the fans some time to reach full speed...
hwmon1/device/fan1_input current speed: 0 ... skipping!
hwmon1/device/fan2_input current speed: 0 ... skipping!
hwmon1/device/fan3_input current speed: 0 ... skipping!
hwmon1/device/fan5_input current speed: 0 ... skipping!
hwmon2/device/fan1_input current speed: 0 ... skipping!
hwmon2/device/fan2_input current speed: 1931 RPM <-- cpu fan
Note, the CPUs are very close together and to the rear chassis fan, this
prevents me from installing both CPU fans. I opted to keep the larger
(quieter) chassis fan adjacent to the second CPU over the second smaller
CPU fan.
hwmon2/device/fan3_input current speed: 0 ... skipping!
hwmon2/device/fan4_input current speed: 2652 RPM <-- small chassis fan
hwmon2/device/fan5_input current speed: 1814 RPM <-- large chassis fan
hwmon2/device/fan6_input current speed: 0 ... skipping!
---------------------------
The fans didn't change speed during the pwmconfig run. I did allow it to
switch all the pwm controls to manual mode.
Does the board manual say whether the case fans are supposed to be
controllable, or only the CPU fans?
$ rage-ipmi.sh sensor
FAN 1 | na | RPM | na | na | na | na | na | na | na
FAN 2 | 1936.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | na | RPM | na | na | na | na | na | na | na
FAN 4 | 2704.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 1764.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 6 | na | RPM | na | na | na | na | na | na | na
CPU1 Vcore | 0.952 | Volts | ok | 0.776 | 0.800 | 0.824 | 1.352 | 1.376 | 1.400
CPU2 Vcore | 0.952 | Volts | ok | 0.776 | 0.800 | 0.824 | 1.352 | 1.376 | 1.400
CPU1 DIMM | 1.520 | Volts | ok | 1.288 | 1.312 | 1.336 | 1.656 | 1.680 | 1.704
CPU2 DIMM | 1.520 | Volts | ok | 1.288 | 1.312 | 1.336 | 1.656 | 1.680 | 1.704
+1.5 V | na | Volts | na | na | na | na | na | na | na
+5 V | 5.056 | Volts | ok | 4.416 | 4.448 | 4.480 | 5.536 | 5.568 | 5.600
+5VSB | 5.056 | Volts | ok | 4.416 | 4.448 | 4.480 | 5.536 | 5.568 | 5.600
+12 V | 12.137 | Volts | ok | 10.600 | 10.653 | 10.706 | 13.250 | 13.303 | 13.356
-12 V | -11.904 | Volts | ok | -13.650 | -13.456 | -13.262 | -10.546 | -10.352 | -10.158
VTT | 1.112 | Volts | ok | 0.808 | 0.816 | 0.824 | 1.320 | 1.336 | 1.352
+3.3VCC | 3.264 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
+3.3VSB | 3.264 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
VBAT | 3.096 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
CPU1 Temp | 0x1 | discrete | 0x0000| na | na | na | na | na | na
CPU2 Temp | 0x1 | discrete | 0x0000| na | na | na | na | na | na
System Temp | 40.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 75.000 | 77.000 | 79.000
P1-DIMM1A | 37.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 65.000 | 70.000 | 75.000
P1-DIMM2A | na | degrees C | na | na | na | na | na | na | na
P1-DIMM3A | na | degrees C | na | na | na | na | na | na | na
P2-DIMM1A | 37.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 65.000 | 70.000 | 75.000
P2-DIMM2A | na | degrees C | na | na | na | na | na | na | na
P2-DIMM3A | na | degrees C | na | na | na | na | na | na | na
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS Status | 0x1 | discrete | 0x01ff| na | na | na | na | na | na
$ dmesg | grep 83795
[ 12.643929] i2c i2c-0: Found w83795adg rev. B at 0x2f
[ 12.883789] w83795 0-002f: PECI agent 1 Tbase temperature: 100
[ 12.903779] w83795 0-002f: PECI agent 2 Tbase temperature: 100
[ 2288.932629] w83795 0-002f: Failed to read from register 0x030, err -6
[ 2613.292773] w83795 0-002f: Failed to write to register 0x040, err -6
[ 2693.333461] w83795 0-002f: Failed to read from register 0x01e, err -11
-6 is -ENXIO, returned by the i2c-i801 driver when a slave I2C device
doesn't answer. -11 is -EAGAIN, meaning arbitration loss, which can
happen on multi-master I2C buses, and I guess IPMI is implemented
exactly that way.
Am I doing something wrong?
Yes. You are using IPMI and a native Linux driver to access the same
monitoring chip. Both access methods don't know of each other and are
not synchronized.
Can I provide any additional information to
help narrow down what might be wrong?
Choose between IPMI and native drivers. If you want to use IPMI on this
board, then you have to forget about the w83795 driver. And about
software-driven fan speed control too, I'm afraid.

Did you look for a BIOS or IPMI firmware update already?
--
Jean Delvare
http://khali.linux-fr.org/wishlist.html
Darren Hart
2011-04-07 20:59:13 UTC
Permalink
Post by Jean Delvare
Hi Darren,
I am redirecting this discussion to the right mailing list.
I haven't been able to control the fan speed using the w83795 driver.
The BIOS "Quiet" setting appears to be braindead as it runs quietly for
a while and then switches to near full throttle for a minute or so and
then returns to the previous state (this is with the system basically
idle). The temperatures (from w83795adg-i2c-0-2f) never reach anything
At least, if the BIOS has a "Quiet" setting, this suggests that the
hardware is designed for fan speed control.
Do you see any message in the kernel logs when the fan switches to high
speed?
No. Nothing.
Post by Jean Delvare
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
This is very hot.
It is... and yet it's much hotter than anything reported by coretemp (which I assumed would have some of the higher temperatures). Any idea what temp1 might be measuring?

$ sensors | grep ?C
Core 0: +26.0?C (high = +81.0?C, crit = +101.0?C) Core 1: +26.0?C (high = +81.0?C, crit = +101.0?C) Core 2: +24.0?C (high = +81.0?C, crit = +101.0?C) Core 8: +22.0?C (high = +81.0?C, crit = +101.0?C) temp1: +40.0?C (high = +138.0?C, hyst = +96.0?C) sensor = thermistor
temp2: -61.0?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp3: +36.5?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C) (crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +35.8?C (high = +127.0?C, hyst = +127.0?C) (crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +24.8?C (high = +95.0?C, hyst = +92.0?C) (crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +23.0?C (high = +95.0?C, hyst = +92.0?C) (crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
Core 9: +25.0?C (high = +81.0?C, crit = +101.0?C) Core 10: +24.0?C (high = +81.0?C, crit = +101.0?C) Core 0: +24.0?C (high = +81.0?C, crit = +101.0?C) Core 1: +21.0?C (high = +81.0?C, crit = +101.0?C) Core 2: +20.0?C (high = +81.0?C, crit = +101.0?C) Core 8: +15.0?C (high = +81.0?C, crit = +101.0?C) Core 9: +22.0?C (high = +81.0?C, crit = +101.0?C) Core 10: +19.0?C (high = +81.0?C, crit = +101.0?C)
Post by Jean Delvare
temp5: +40.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +29.5?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +25.5?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
...
OK, waited 10 minutes and it didn't want to scream at me. But if memory
serves, there is only a variance of a few degrees before the fans kick
in.
None of the measurements above is anywhere close to its set limits, so
this behavior isn't caused by an alarm raised by the W83795ADG.
I'm hoping to use pwmconfig/fancontrol with the w83795 driver to restore
some sanity to the fan usage. I tried with V 0.7 on the Ubuntu 10.10
server kernel (vmlinuz-2.6.35-22-server) as well as with the current
version in the linux-2.6.git tree (2.6.39-rc1+). I'm running on the
following hardware with a pair of Intel Xeon X5680 CPUs.
SUPERMICRO MBD-X8DTL-iF-O Motherboard
http://www.supermicro.com/products/motherboard/QPI/5500/X8DTL-iF.cfm
linux-2.6.39-rc1+: 99759619b27662d1290901228d77a293e6e83200
$ grep 83795 .config
CONFIG_SENSORS_W83795=m
CONFIG_SENSORS_W83795_FANCTRL=y
$ lsmod | grep 83795
w83795 43879 0
---------------------------
hwmon0/device is max1617
This would be very surprising and smells like a misdetection. Which
could, in turn, explain (some of) your problems. What the use of the
adm1021 driver suggested by sensors-detect?
Hrm, I noticed it reports:
Intel Core family thermal sensor... No
But if I load coretemp I get 12 sane temperature readings...

It does not detect adm1021, but it did report:

Trying family `National Semiconductor'... Yes
Found unknown chip with ID 0x1a11

However Kconfig says:

? If you say yes here you get support for Analog Devices ADM1021 ? ? and ADM1023 sensor chips and clones: Maxim MAX1617 and MAX1617A, ? ? Genesys Logic GL523SM, National Semiconductor LM84, TI THMC10, ? ? and the XEON processor built-in sensor.
These are XEON CPUs, is this an older interface that has been replaced by something else?
Post by Jean Delvare
I presume that the output
for the supposed max1617 chip in "sensors" is plain wrong? I would
advise that you do not load the adm1021 driver.
OK, unloaded.
Post by Jean Delvare
hwmon1/device is w83627dhg
Super-I/O (multifunction) chip, probably not used for monitoring.
Unloading the w83627ehf driver would make running pwmconfig much easier.
Done
Post by Jean Delvare
hwmon2/device is w83795adg <--- So it found the device
hwmon1/device/pwm1
hwmon1/device/pwm2
hwmon1/device/pwm3
hwmon2/device/pwm1
hwmon2/device/pwm1 stuck to 125 <--- This doesn't look good.
Manual control mode not supported, skipping hwmon2/device/pwm1.
Indeed. This suggests that the driver wasn't able to switch this fan
output to manual mode. The strange thing is that it works for me, with
the same chip on a different board (lm-sensors 3.3.0, kernel 2.6.38.2.)
$ sensors --version
sensors version 3.1.2 with libsensors version 3.1.2

$ uname -a
2.6.39-rc1+
Post by Jean Delvare
hwmon2/device/pwm2 <--- Which fans does it control?
The next steps in pwmconfig should tell. One thing worth noting is that
you have 6 fan inputs used on the W83795ADG, but the chip has only two
fan control outputs. So it is impossible that you have one control per
fan. On my board, pwm1 controls both CPU fans and pwm2 controls all 6
case fans.
I read somewhere during my hours of searching for a solution to this that both CPU fans are controlled by the same pwm signal, so that is not surprising. It's too bad about the case fans though, I really like to run the larger quiet fan up before bringing up the smaller front fan, but, it is what it is.
Post by Jean Delvare
Giving the fans some time to reach full speed...
hwmon1/device/fan1_input current speed: 0 ... skipping!
hwmon1/device/fan2_input current speed: 0 ... skipping!
hwmon1/device/fan3_input current speed: 0 ... skipping!
hwmon1/device/fan5_input current speed: 0 ... skipping!
hwmon2/device/fan1_input current speed: 0 ... skipping!
hwmon2/device/fan2_input current speed: 1931 RPM <-- cpu fan
Note, the CPUs are very close together and to the rear chassis fan, this
prevents me from installing both CPU fans. I opted to keep the larger
(quieter) chassis fan adjacent to the second CPU over the second smaller
CPU fan.
hwmon2/device/fan3_input current speed: 0 ... skipping!
hwmon2/device/fan4_input current speed: 2652 RPM <-- small chassis fan
hwmon2/device/fan5_input current speed: 1814 RPM <-- large chassis fan
hwmon2/device/fan6_input current speed: 0 ... skipping!
---------------------------
The fans didn't change speed during the pwmconfig run. I did allow it to
switch all the pwm controls to manual mode.
I ran pwmconfig again with adm1021, ipmi_si, and w83627ehf unloaded. This time it detected 8 pwm interfaces, and only pwm1 failed to enter manual mode.

hwmon2/device is w83795g

Found the following PWM controls:
hwmon2/device/pwm1
hwmon2/device/pwm1 is currently setup for automatic speed control.
In general, automatic mode is preferred over manual mode, as
it is more efficient and it reacts faster. Are you sure that
you want to setup this output for manual control? (n) y
hwmon2/device/pwm1 stuck to 125

While trying to turn them off, I watched syslog:

During pwm3 test:
Apr 7 08:40:48 rage kernel: [ 1617.363333] w83795 0-002f: Failed to read from register 0x023, err -6

I then searched for the pwm controls manually and tried adjusting them. I was able reduce fan noise considerably by echo'ing 0 to pwm1, and I brought it back up by echo'ing 125 to it. I didn't notice any change with the other pwms. Also, the fan speed as reported by sensors stayed constant, even though they obviously had slowed down considerably.

# for PWM in $(find . -name "pwm[0-8]"); do echo $PWM; echo 0 > $PWM; echo -n "Off ($(cat $PWM))..."; sleep 5; echo 125 > $PWM; echo "On ($(cat $PWM))"; done
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm1
Off (0)...On (119)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm2
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm3
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm4
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm5
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm6
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm7
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm8
Off (0)...On (0)

I ran pwmconfig again... and it didn't complain about pwm1 not entering manual mode. It was also able to bring the fans up and shut them down with pwm1. It did NOT detect a correlation however.

I hit a bug in pwmconfig when configuring the pwm temperature input and fan speeds:

--------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35

Enter the high temperature (degree C)
over which the fan should spin at maximum speed (60): /usr/sbin/pwmconfig: line 923: [: -eq: unary operator expected
/usr/sbin/pwmconfig: line 949: [: -eq: unary operator expected
--------------

923:
if [ $FAN_MIN -eq 0 ]
949:
if [ $FAN_MIN -eq 0 ]

Apparently, earlier in the script (line 877):

FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`

sets FAN_MIN to "" instead of a number. Adding some debug confirms this:
FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`
# dvhart debug
if [ -z "$FAN_MIN" ]; then
echo "FAN_MIN detection failed, setting to 0."
FAN_MIN=0
fi

------------
FAN_MIN detection failed, setting to 0.
------------

------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35

Enter the high temperature (degree C)
over which the fan should spin at maximum speed (60):
Enter the minimum PWM value (0-255)
at which the fan STOPS spinning (press t to test) (100): t

Now we decrease the PWM value to figure out the lowest usable value.
We will use a slightly greater value as the minimum speed.
------------

After fixing that, the detection of the lowest value (where the fan stops) ran for 30 minutes without indicating any forward progress or making an audibly detectable change in fan speed. I tried adjusting it manually, and was able to make several speed adjustments, finding the min value somewhere between 35 and 50 (sys reports 'pwm1_start: 48'). Before I could finish, the interface stopped responding to commands. I reloaded the w83795 module, and pwmconfig then reported:

/usr/sbin/pwmconfig: There are no fan-capable sensor modules installed

And sensors only reported:

# sensors
w83795g-i2c-0-2f
Adapter: SMBus I801 adapter at 0400
beep_enable:enabled
Post by Jean Delvare
Does the board manual say whether the case fans are supposed to be
controllable, or only the CPU fans?
It is rather vague on the topic unfortunately:

"Fan status monitor with firmware control and CPU fan auto-off in sleep mode"
"Pule Width Modulation (PWM) Fan Control"
"The PC health monitor can check the RPM status of the cooling fans. The onboard CPU and chassis fans are controlled by Thermal Management via BIOS (under Hardware Monitoring in the Advanced Setting)."

And under the Nuvoton WPCM450R Controller (the baseboard management controller):
"The WPCM450R communicates with onboard components via six SMBus interfaces, fan control, and Platform Environment Control Interface (PECI) buses."

The case fans are definitely controllable given my experiment above on pwm1. pwm2 doesn't appear to do anything... and I'm not sure what 3-8 are supposed to do :-)
Post by Jean Delvare
$ rage-ipmi.sh sensor
FAN 1 | na | RPM | na | na | na | na | na | na | na
FAN 2 | 1936.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | na | RPM | na | na | na | na | na | na | na
FAN 4 | 2704.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 1764.000 | RPM | ok | 400.000 | 576.000 | 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 6 | na | RPM | na | na | na | na | na | na | na
CPU1 Vcore | 0.952 | Volts | ok | 0.776 | 0.800 | 0.824 | 1.352 | 1.376 | 1.400
CPU2 Vcore | 0.952 | Volts | ok | 0.776 | 0.800 | 0.824 | 1.352 | 1.376 | 1.400
CPU1 DIMM | 1.520 | Volts | ok | 1.288 | 1.312 | 1.336 | 1.656 | 1.680 | 1.704
CPU2 DIMM | 1.520 | Volts | ok | 1.288 | 1.312 | 1.336 | 1.656 | 1.680 | 1.704
+1.5 V | na | Volts | na | na | na | na | na | na | na
+5 V | 5.056 | Volts | ok | 4.416 | 4.448 | 4.480 | 5.536 | 5.568 | 5.600
+5VSB | 5.056 | Volts | ok | 4.416 | 4.448 | 4.480 | 5.536 | 5.568 | 5.600
+12 V | 12.137 | Volts | ok | 10.600 | 10.653 | 10.706 | 13.250 | 13.303 | 13.356
-12 V | -11.904 | Volts | ok | -13.650 | -13.456 | -13.262 | -10.546 | -10.352 | -10.158
VTT | 1.112 | Volts | ok | 0.808 | 0.816 | 0.824 | 1.320 | 1.336 | 1.352
+3.3VCC | 3.264 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
+3.3VSB | 3.264 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
VBAT | 3.096 | Volts | ok | 2.880 | 2.904 | 2.928 | 3.648 | 3.672 | 3.696
CPU1 Temp | 0x1 | discrete | 0x0000| na | na | na | na | na | na
CPU2 Temp | 0x1 | discrete | 0x0000| na | na | na | na | na | na
System Temp | 40.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 75.000 | 77.000 | 79.000
P1-DIMM1A | 37.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 65.000 | 70.000 | 75.000
P1-DIMM2A | na | degrees C | na | na | na | na | na | na | na
P1-DIMM3A | na | degrees C | na | na | na | na | na | na | na
P2-DIMM1A | 37.000 | degrees C | ok | -9.000 | -7.000 | -5.000 | 65.000 | 70.000 | 75.000
P2-DIMM2A | na | degrees C | na | na | na | na | na | na | na
P2-DIMM3A | na | degrees C | na | na | na | na | na | na | na
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS Status | 0x1 | discrete | 0x01ff| na | na | na | na | na | na
$ dmesg | grep 83795
[ 12.643929] i2c i2c-0: Found w83795adg rev. B at 0x2f
[ 12.883789] w83795 0-002f: PECI agent 1 Tbase temperature: 100
[ 12.903779] w83795 0-002f: PECI agent 2 Tbase temperature: 100
[ 2288.932629] w83795 0-002f: Failed to read from register 0x030, err -6
[ 2613.292773] w83795 0-002f: Failed to write to register 0x040, err -6
[ 2693.333461] w83795 0-002f: Failed to read from register 0x01e, err -11
-6 is -ENXIO, returned by the i2c-i801 driver when a slave I2C device
doesn't answer. -11 is -EAGAIN, meaning arbitration loss, which can
happen on multi-master I2C buses, and I guess IPMI is implemented
exactly that way.
Am I doing something wrong?
Yes. You are using IPMI and a native Linux driver to access the same
monitoring chip. Both access methods don't know of each other and are
not synchronized.
OK, I removed the ipmi_si driver early on and am still seeing the problems described above.
Post by Jean Delvare
Can I provide any additional information to
help narrow down what might be wrong?
Choose between IPMI and native drivers. If you want to use IPMI on this
board, then you have to forget about the w83795 driver. And about
software-driven fan speed control too, I'm afraid.
Does that mean all IPMI features? I'd hate to have to lose SOL and power control.
Post by Jean Delvare
Did you look for a BIOS or IPMI firmware update already?
IPMI is current.
BIOS had an update available. After hunting down a FreeDOS USB boot image, I managed to flash it. pwmconfig is much happier now, and the sensors report the fan speed correctly now. pwmconfig walked through the PWM:RPM mapping for fan2_input, and all three fans dropped along with it. When it started in on fan4_input produced an error:

----------
hwmon2/device/fan4_input ... speed was 4285 now 1058
It appears that fan hwmon2/device/fan4_input
is controlled by pwm hwmon2/device/pwm1
/usr/sbin/pwmconfig: line 464: hwmon2/device: expression recursion level exceeded (error token is "device")
Testing is complete.
----------

line 464
fanactive="$(($j+${fanactive}))" #not supported yet by fancontrol


fancontrol appears to work now as well. It appears all my fans are connected to the same PWM control, which is pretty unfortunate, but things are MUCH better now than they were. It appears there are a few scripting bugs in pwmconfig (at least in my distro version) that can be corrected with some string checking, but the core problem appears to be a buggy BIOS - big surprise ;-)

I am not sure which temperature sensor to use to control pwm1. I don't trust the temp1 input of 82C, temp5 reads 39 idle, and 7 and 8 read about 25 idle. While the coretemp sensors read 24-29.

temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +39.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +25.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +22.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI

# sensors | grep Core
Core 0: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +23.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +17.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +20.0?C (high = +81.0?C, crit = +101.0?C)


And as I'm typing this, dmesg started spewing a lot of errors and temp1-5 now report 0?C

[ 1056.545180] w83795 0-002f: Failed to write to register 0x040, err -6
[ 1056.585158] w83795 0-002f: Failed to read from register 0x041, err -6
[ 1056.605143] w83795 0-002f: Failed to read from register 0x042, err -6
[ 1056.645123] w83795 0-002f: Failed to read from register 0x043, err -6
[ 1056.685094] w83795 0-002f: Failed to read from register 0x044, err -6
[ 1056.705084] w83795 0-002f: Failed to read from register 0x045, err -6
[ 1056.745057] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1056.765044] w83795 0-002f: Failed to write to register 0x040, err -6
....
[ 1060.442767] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.482745] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.502728] w83795 0-002f: Failed to set bank to 2, err -6
...
[ 1060.702605] w83795 0-002f: Failed to read from register 0x040, err -6
[ 1060.722590] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1060.762569] w83795 0-002f: Failed to write to register 0x040, err -6
...
and on for pages.

Reloading w83795 stops the messages, but the w83795 sensors don't come back.

OK, that's a ton of data, hopefully it's good data.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Jean Delvare
2011-04-08 12:46:45 UTC
Permalink
Hi Darren,
Post by Darren Hart
Post by Jean Delvare
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
This is very hot.
It is... and yet it's much hotter than anything reported by coretemp (which
I assumed would have some of the higher temperatures).
Not necessarily, depending on your cooling mechanism. These days,
several parts of the system can be much hotter than the CPU, in
particular the graphics chip (for high end graphics cards) and the
north bridge.
Post by Darren Hart
Any idea what temp1 might be measuring?
Could be the north bridge. On my own Intel 5500-based system, I am
using an external sensor to monitor the north bridge temperature, and
here is what I get:

TR2 Temp: +92.2?C (high = +85.0?C, hyst = +82.0?C) ALARM
(crit = +90.0?C, crit hyst = +87.0?C) sensor = thermistor

And I've already seen it hotter than this.
Post by Darren Hart
$ sensors | grep ?C
Core 0: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +22.0?C (high = +81.0?C, crit = +101.0?C)
temp1: +40.0?C (high = +138.0?C, hyst = +96.0?C) sensor = thermistor
temp2: -61.0?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp3: +36.5?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +35.8?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +24.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +23.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
Core 9: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +20.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +15.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +22.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +19.0?C (high = +81.0?C, crit = +101.0?C)
Unrelated to your issue, but the core numbering by coretemp is
surprising. I'm curious if you see the same in /proc/cpuinfo.

Please note that the temperatures reported by coretemp are not real,
absolute ?C. They are a delta from the critical limit, the accuracy of
which degrades quickly with large deltas (i.e. low temperatures.) So,
all that can be said from the above "Core" temperature values is that
your CPUs run very cool and way below their critical limit (which is
good.)

Two of the three temperatures reported by the w83627ehf driver look
sane, so my advice to not load this driver might not have been correct.
It may be better to load it, and configure libsensors to ignore all the
unused inputs.
Post by Darren Hart
Post by Jean Delvare
(...)
---------------------------
hwmon0/device is max1617
This would be very surprising and smells like a misdetection. Which
could, in turn, explain (some of) your problems. What the use of the
adm1021 driver suggested by sensors-detect?
Intel Core family thermal sensor... No
But if I load coretemp I get 12 sane temperature readings...
Presumably you are using a relatively old version of the sensors-detect
script. This version:
http://dl.lm-sensors.org/lm-sensors/files/sensors-detect
should find the Intel Core family thermal sensor. It might also solve
the adm1021 mystery... Could be that you have thermal sensors in your
memory modules, and the jc42 driver would report their temperature.
How did the adm1021 driver get loaded in the first place then? Please
note that sensors-detect needs hwmon drivers to be unloaded first to be
most efficient.
Post by Darren Hart
Trying family `National Semiconductor'... Yes
Found unknown chip with ID 0x1a11
No idea what it is, and this is somewhat surprising as you already have
one identified Super-I/O chip (W83627DHG-P, as documented by
Supermicro.)
Post by Darren Hart
? If you say yes here you get support for Analog Devices ADM1021 ?
? and ADM1023 sensor chips and clones: Maxim MAX1617 and MAX1617A, ?
? Genesys Logic GL523SM, National Semiconductor LM84, TI THMC10, ?
? and the XEON processor built-in sensor.
These are XEON CPUs, is this an older interface that has been replaced by something else?
This really only applies to an old generation of Xeon processors which
were popular in 2003. These days this help text is seriously
misleading, I'll fix it. Thanks for reporting.
Post by Darren Hart
Post by Jean Delvare
I presume that the output
for the supposed max1617 chip in "sensors" is plain wrong? I would
advise that you do not load the adm1021 driver.
OK, unloaded.
Post by Jean Delvare
hwmon1/device is w83627dhg
Super-I/O (multifunction) chip, probably not used for monitoring.
Unloading the w83627ehf driver would make running pwmconfig much easier.
Done
As noted above, this driver might still be somewhat useful after all.
Post by Darren Hart
Post by Jean Delvare
(...)
The next steps in pwmconfig should tell. One thing worth noting is that
you have 6 fan inputs used on the W83795ADG, but the chip has only two
fan control outputs. So it is impossible that you have one control per
fan. On my board, pwm1 controls both CPU fans and pwm2 controls all 6
case fans.
I read somewhere during my hours of searching for a solution to this that
both CPU fans are controlled by the same pwm signal, so that is not
surprising. It's too bad about the case fans though, I really like to run
the larger quiet fan up before bringing up the smaller front fan, but,
it is what it is.
As you don't seem to be using the second CPU fan header, you could
cheat and plug your large rear fan in this header, so pwm1 would
control it (if we manage to get this to work at all...)

BTW, the Supermicro documentation is pretty clear that fan control is
only supported when using 4-pin fans. Is it what you're using?
Post by Darren Hart
I ran pwmconfig again with adm1021, ipmi_si, and w83627ehf unloaded. This
time it detected 8 pwm interfaces, and only pwm1 failed to enter manual mode.
hwmon2/device is w83795g
Ouch. Last time your chip was a W83795ADG (the small version with only
2 fan control outputs) and now you are supposed to have a W83795G (the
big version with 8 fan control outputs.) The Supermicro product
description doesn't tell which is present, but to be fair, I've never
seen a W83795G on a PC mainboard so far, only W83795ADG.

Anyway, this suggests unreliable I/O on the SMBus. So even though you
have unloaded ipmi_si, which should guarantee that the Linux host isn't
accessing the chip through IPMI, I suspect that something else is still
accessing the chip in our back. A BMC for remote management?

Didn't you get an error message in the kernel logs related to w83795
register 0x001? This is where the driver gets the chip type from.

I think I get what's happening. The W83795G/ADG chips have so-called
banked registers, which means that you have to select the right bank
before accessing a given register. To improve register access time, the
driver remembers the currently selected bank, and only selects a
different bank when needed. Now, if somebody else accesses the chip
in our back, this assumption gets wrong suddenly.

I could change the driver to unconditionally set the bank before any
register access, at the price of severely decreased performance.
However, even this would not completely solve the problem, as whoever
else is accessing the chip might do so between the w83795 driver
setting the bank and the w83795 driver reading (or writing) the
register value - and nothing can be done against this.

The bottom line is that using the W83795 driver in a multi-master I2C
setup (and I strongly suspect this is what Supermicro did) is a bad
hardware design mistake. This hardware monitoring device wasn't
designed with this use case in mind.
Post by Darren Hart
hwmon2/device/pwm1
hwmon2/device/pwm1 is currently setup for automatic speed control.
In general, automatic mode is preferred over manual mode, as
it is more efficient and it reacts faster. Are you sure that
you want to setup this output for manual control? (n) y
hwmon2/device/pwm1 stuck to 125
Apr 7 08:40:48 rage kernel: [ 1617.363333] w83795 0-002f: Failed to read from register 0x023, err -6
The driver was temporarily unable to read the in19 value.
Post by Darren Hart
I then searched for the pwm controls manually and tried adjusting them.
I was able reduce fan noise considerably by echo'ing 0 to pwm1, and I
brought it back up by echo'ing 125 to it. I didn't notice any change
Odd, this is exactly what pwmconfig is doing. It's hard to explain how
pwmconfig could consistently fail and your manual attempt worked right
away. It may not work always though?
Post by Darren Hart
with the other pwms. Also, the fan speed as reported by sensors stayed
constant, even though they obviously had slowed down considerably.
My bet is that you don't have pwm3 to pwm8 anyway, so it's expected
they had no effect.
Post by Darren Hart
# for PWM in $(find . -name "pwm[0-8]"); do echo $PWM; echo 0 > $PWM; echo -n "Off ($(cat $PWM))..."; sleep 5; echo 125 > $PWM; echo "On ($(cat $PWM))"; done
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm1
Off (0)...On (119)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm2
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm3
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm4
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm5
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm6
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm7
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm8
Off (0)...On (0)
I ran pwmconfig again... and it didn't complain about pwm1 not entering
manual mode. It was also able to bring the fans up and shut them down
with pwm1. It did NOT detect a correlation however.
This is all consistent with my theory about random bank switches.
Post by Darren Hart
--------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35
Enter the high temperature (degree C)
over which the fan should spin at maximum speed (60): /usr/sbin/pwmconfig: line 923: [: -eq: unary operator expected
/usr/sbin/pwmconfig: line 949: [: -eq: unary operator expected
--------------
if [ $FAN_MIN -eq 0 ]
if [ $FAN_MIN -eq 0 ]
Your line numbers don't match mine, which means you aren't using the
latest upstream version of pwmconfig. So I can't help, sorry.
Post by Darren Hart
FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`
FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`
# dvhart debug
if [ -z "$FAN_MIN" ]; then
echo "FAN_MIN detection failed, setting to 0."
FAN_MIN=0
fi
------------
FAN_MIN detection failed, setting to 0.
------------
This certainly explains why a correlation couldn't be found. Your
workaround however is not correct. If fanactive_min has fewer elements
than expected, this means that CURRENT_SPEEDS too, but you don't know
which ones are missing, because CURRENT_SPEEDS is a string, not an
array. We should really be using proper bash arrays for robustness, but
I simply don't have the time to work on this these days.

Overall the pwmconfig (and fancontrol) code isn't good quality, partly
because it started as an afternoon hack and has grown way too old,
partly because writing nice and efficient code in bash can be quite
challenging. I think someone posted on the lm-sensors list to announce
a rewrite in C, which might be a better starting point.
Post by Darren Hart
------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35
Enter the high temperature (degree C)
Enter the minimum PWM value (0-255)
at which the fan STOPS spinning (press t to test) (100): t
Now we decrease the PWM value to figure out the lowest usable value.
We will use a slightly greater value as the minimum speed.
------------
After fixing that, the detection of the lowest value (where the fan
stops) ran for 30 minutes without indicating any forward progress or
making an audibly detectable change in fan speed. I tried adjusting
it manually, and was able to make several speed adjustments, finding
This suggests more problems in pwmconfig, it isn't supposed to behave
that way. But again the root cause is probably the kernel driver not
behaving in the standard way pwmconfig expects. In turn caused by the
hardware playing tricks on you.
Post by Darren Hart
48'). Before I could finish, the interface stopped responding to
/usr/sbin/pwmconfig: There are no fan-capable sensor modules installed
# sensors
w83795g-i2c-0-2f
Adapter: SMBus I801 adapter at 0400
beep_enable:enabled
Wow. Your system is very strange. I can't even think of how such an
output would be possible at all.
Post by Darren Hart
Post by Jean Delvare
Does the board manual say whether the case fans are supposed to be
controllable, or only the CPU fans?
"Fan status monitor with firmware control and CPU fan auto-off in sleep mode"
"Pule Width Modulation (PWM) Fan Control"
"The PC health monitor can check the RPM status of the cooling fans. The
onboard CPU and chassis fans are controlled by Thermal Management via BIOS
(under Hardware Monitoring in the Advanced Setting)."
I read this as: all fans should be controllable.
Post by Darren Hart
"The WPCM450R communicates with onboard components via six SMBus interfaces,
fan control, and Platform Environment Control Interface (PECI) buses."
This seems to be a complex setup, unfortunately the block diagram in
the manual mentions neither SMBus nor PECI.
Post by Darren Hart
The case fans are definitely controllable given my experiment above on pwm1.
pwm2 doesn't appear to do anything... and I'm not sure what 3-8 are supposed
to do :-)
As said before, I am certain you won't have pwm3-8 at all so they
aren't supposed to do anything.
Post by Darren Hart
Post by Jean Delvare
(...)
$ dmesg | grep 83795
[ 12.643929] i2c i2c-0: Found w83795adg rev. B at 0x2f
[ 12.883789] w83795 0-002f: PECI agent 1 Tbase temperature: 100
[ 12.903779] w83795 0-002f: PECI agent 2 Tbase temperature: 100
[ 2288.932629] w83795 0-002f: Failed to read from register 0x030, err -6
[ 2613.292773] w83795 0-002f: Failed to write to register 0x040, err -6
[ 2693.333461] w83795 0-002f: Failed to read from register 0x01e, err -11
-6 is -ENXIO, returned by the i2c-i801 driver when a slave I2C device
doesn't answer. -11 is -EAGAIN, meaning arbitration loss, which can
happen on multi-master I2C buses, and I guess IPMI is implemented
exactly that way.
Am I doing something wrong?
Yes. You are using IPMI and a native Linux driver to access the same
monitoring chip. Both access methods don't know of each other and are
not synchronized.
OK, I removed the ipmi_si driver early on and am still seeing the problems described above.
Probably caused by concurrent accesses from the BMC.
Post by Darren Hart
Post by Jean Delvare
Can I provide any additional information to
help narrow down what might be wrong?
Choose between IPMI and native drivers. If you want to use IPMI on this
board, then you have to forget about the w83795 driver. And about
software-driven fan speed control too, I'm afraid.
Does that mean all IPMI features? I'd hate to have to lose SOL and power control.
It's hard to tell what exactly IPMI is doing. Clearly if you want to
use IPMI then the w83795 driver is out IMHO, and you'll suffer from the
lack of integration between IPMI and libsensors.
Post by Darren Hart
Post by Jean Delvare
Did you look for a BIOS or IPMI firmware update already?
IPMI is current.
BIOS had an update available. After hunting down a FreeDOS USB boot image, I
managed to flash it. pwmconfig is much happier now, and the sensors report
the fan speed correctly now. pwmconfig walked through the PWM:RPM mapping
for fan2_input, and all three fans dropped along with it. When it started
----------
hwmon2/device/fan4_input ... speed was 4285 now 1058
It appears that fan hwmon2/device/fan4_input
is controlled by pwm hwmon2/device/pwm1
/usr/sbin/pwmconfig: line 464: hwmon2/device: expression recursion level exceeded (error token is "device")
Testing is complete.
----------
line 464
fanactive="$(($j+${fanactive}))" #not supported yet by fancontrol
I had never seen this error message before. But I also don't have the
line above in my copy of pwmconfig either. Are you by any chance using a
packaged version with custom patches?
Post by Darren Hart
fancontrol appears to work now as well. It appears all my fans are connected
to the same PWM control, which is pretty unfortunate, but things are MUCH
better now than they were. It appears there are a few scripting bugs in
pwmconfig (at least in my distro version) that can be corrected with
Please test the upstream version. If you find bugs in your distro
version which aren't upstream, report to them, not us. And please ask
them to push their changes upstream (if they are good) or drop them (if
not.)
Post by Darren Hart
some string checking, but the core problem appears to be a buggy BIOS -
big surprise ;-)
I don't want to bash your optimism, but... My personal impression is
that there is a severe design issue on this board, which will prevent
you from using the w83795 driver.
Post by Darren Hart
I am not sure which temperature sensor to use to control pwm1. I don't trust
the temp1 input of 82C, temp5 reads 39 idle, and 7 and 8 read about 25 idle.
While the coretemp sensors read 24-29.
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +39.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +25.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +22.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp5 is the system (board) temperature temp7 is CPU1 and temp8 is
CPU2. I would use temp5 for case fans, and temp7 for CPU fans. A
perfect fan control system would allow you to take the max or average
of multiple temperatures, but we don't support this.

But then again, in your case, software driven fan control seems out of
the question. Way too dangerous when you don't know if you'll be able
to access the monitoring chip the next minute. I really wish board
vendors would let people tweak the automatic fan speed control settings
in the BIOS. Asus offers several profiles, which is better than
nothing, but it would seem fair to let the user set the temperature
limits manually. Sigh.
Post by Darren Hart
# sensors | grep Core
Core 0: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +23.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +17.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +20.0?C (high = +81.0?C, crit = +101.0?C)
And as I'm typing this, dmesg started spewing a lot of errors and temp1-5 now report 0?C
[ 1056.545180] w83795 0-002f: Failed to write to register 0x040, err -6
[ 1056.585158] w83795 0-002f: Failed to read from register 0x041, err -6
[ 1056.605143] w83795 0-002f: Failed to read from register 0x042, err -6
[ 1056.645123] w83795 0-002f: Failed to read from register 0x043, err -6
[ 1056.685094] w83795 0-002f: Failed to read from register 0x044, err -6
[ 1056.705084] w83795 0-002f: Failed to read from register 0x045, err -6
[ 1056.745057] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1056.765044] w83795 0-002f: Failed to write to register 0x040, err -6
....
[ 1060.442767] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.482745] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.502728] w83795 0-002f: Failed to set bank to 2, err -6
...
[ 1060.702605] w83795 0-002f: Failed to read from register 0x040, err -6
[ 1060.722590] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1060.762569] w83795 0-002f: Failed to write to register 0x040, err -6
...
and on for pages.
Reloading w83795 stops the messages, but the w83795 sensors don't come back.
OK, that's a ton of data, hopefully it's good data.
Oh, I suddenly have an idea what may be going on. If I'm right, it even
worse than I thought at first.

I guess that your SMBus is multiplexed. The errors -6 (-ENXIO) mean the
W83795ADG chip is unreachable, presumably because the multiplexer was
switched to a different segment. If the multiplexer is out of the
operating system's control (as seems to be the case here) then you
really have to give up the w83795 driver, much to my despair.

You may be able to get the w83795 driver working again by invoking
ipmitool. If IPMI know how to switch back to the right SMBus segment,
it may leave it selected afterwards. But anyway this is just a trick,
nothing you can rely on in the long run, as the conflict between w83795
and the BMC isn't one we can solve.

It might be the right time for you to ask the Supermicro support for a
detailed topology of the I2C/SMBus on this board.
--
Jean Delvare
http://khali.linux-fr.org/wishlist.html
Darren Hart
2011-04-09 00:11:35 UTC
Permalink
Hey Jean,

I really appreciate your thoughts here. I'll respond inline, but let me
give a summary. I've contacted SuperMicro and am hoping they'll get back
to with a contact to help get some answer regarding how IPMI (WPCM450R)
and W83795-ADG (I checked the chip, -ADG) are supposed to interact and
still allow the OS to read temperature and control fans.

You are correct about temp1, that has to be the northbridge, it is
located right behind the PCI-E slots (which appears to be common
practice) and has a very inadequate heat sink. I'm considering replacing
it with a much more substantial heatsink and possible adding a tunnel to
direct air over it. I've asked SuperMicro for a recommendation here as
well. If I can get that temperature down, my guess is the BIOS fan
control might be able to do a much better job and I won't need the
w83795-adg fancontrol from the OS quite so bad.
Post by Jean Delvare
Hi Darren,
Post by Darren Hart
Post by Jean Delvare
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
This is very hot.
It is... and yet it's much hotter than anything reported by coretemp (which
I assumed would have some of the higher temperatures).
Not necessarily, depending on your cooling mechanism. These days,
several parts of the system can be much hotter than the CPU, in
particular the graphics chip (for high end graphics cards) and the
north bridge.
bingo, north bridge
Post by Jean Delvare
Post by Darren Hart
Any idea what temp1 might be measuring?
Could be the north bridge. On my own Intel 5500-based system, I am
using an external sensor to monitor the north bridge temperature, and
TR2 Temp: +92.2?C (high = +85.0?C, hyst = +82.0?C) ALARM
(crit = +90.0?C, crit hyst = +87.0?C) sensor = thermistor
And I've already seen it hotter than this.
Post by Darren Hart
$ sensors | grep ?C
Core 0: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +22.0?C (high = +81.0?C, crit = +101.0?C)
temp1: +40.0?C (high = +138.0?C, hyst = +96.0?C) sensor = thermistor
temp2: -61.0?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp3: +36.5?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +35.8?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +24.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +23.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
Core 9: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +20.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +15.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +22.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +19.0?C (high = +81.0?C, crit = +101.0?C)
Unrelated to your issue, but the core numbering by coretemp is
surprising. I'm curious if you see the same in /proc/cpuinfo.
No I do not. The Core ID you see above refers to physical cores per
socket (there are six per socket). I had also found this odd and wrote
one of the authors of coretemp about it. There appears to be some effort
ongoing to try and get those numbers to align with what is used in the
rest of the system to identify CPUs. Note that cpuinfo lists 24 CPUs due
to hyper-threading, while coretemp is only concerned with physical cores.
Post by Jean Delvare
Please note that the temperatures reported by coretemp are not real,
absolute ?C. They are a delta from the critical limit, the accuracy of
which degrades quickly with large deltas (i.e. low temperatures.) So,
all that can be said from the above "Core" temperature values is that
your CPUs run very cool and way below their critical limit (which is
good.)
Noted! Thanks.
Post by Jean Delvare
Two of the three temperatures reported by the w83627ehf driver look
sane, so my advice to not load this driver might not have been correct.
It may be better to load it, and configure libsensors to ignore all the
unused inputs.
OK.
Post by Jean Delvare
Post by Darren Hart
Post by Jean Delvare
(...)
---------------------------
hwmon0/device is max1617
This would be very surprising and smells like a misdetection. Which
could, in turn, explain (some of) your problems. What the use of the
adm1021 driver suggested by sensors-detect?
Intel Core family thermal sensor... No
But if I load coretemp I get 12 sane temperature readings...
Presumably you are using a relatively old version of the sensors-detect
http://dl.lm-sensors.org/lm-sensors/files/sensors-detect
should find the Intel Core family thermal sensor. It might also solve
the adm1021 mystery... Could be that you have thermal sensors in your
memory modules, and the jc42 driver would report their temperature.
How did the adm1021 driver get loaded in the first place then? Please
note that sensors-detect needs hwmon drivers to be unloaded first to be
most efficient.
Perhaps it was detected under the Ubuntu kernel, not sure.
Post by Jean Delvare
Post by Darren Hart
Trying family `National Semiconductor'... Yes
Found unknown chip with ID 0x1a11
No idea what it is, and this is somewhat surprising as you already have
one identified Super-I/O chip (W83627DHG-P, as documented by
Supermicro.)
Post by Darren Hart
? If you say yes here you get support for Analog Devices ADM1021 ?
? and ADM1023 sensor chips and clones: Maxim MAX1617 and MAX1617A, ?
? Genesys Logic GL523SM, National Semiconductor LM84, TI THMC10, ?
? and the XEON processor built-in sensor.
These are XEON CPUs, is this an older interface that has been replaced by something else?
This really only applies to an old generation of Xeon processors which
were popular in 2003. These days this help text is seriously
misleading, I'll fix it. Thanks for reporting.
Cool, thanks.
Post by Jean Delvare
Post by Darren Hart
Post by Jean Delvare
I presume that the output
for the supposed max1617 chip in "sensors" is plain wrong? I would
advise that you do not load the adm1021 driver.
OK, unloaded.
Post by Jean Delvare
hwmon1/device is w83627dhg
Super-I/O (multifunction) chip, probably not used for monitoring.
Unloading the w83627ehf driver would make running pwmconfig much easier.
Done
As noted above, this driver might still be somewhat useful after all.
Got it.
Post by Jean Delvare
Post by Darren Hart
Post by Jean Delvare
(...)
The next steps in pwmconfig should tell. One thing worth noting is that
you have 6 fan inputs used on the W83795ADG, but the chip has only two
fan control outputs. So it is impossible that you have one control per
fan. On my board, pwm1 controls both CPU fans and pwm2 controls all 6
case fans.
I read somewhere during my hours of searching for a solution to this that
both CPU fans are controlled by the same pwm signal, so that is not
surprising. It's too bad about the case fans though, I really like to run
the larger quiet fan up before bringing up the smaller front fan, but,
it is what it is.
As you don't seem to be using the second CPU fan header, you could
cheat and plug your large rear fan in this header, so pwm1 would
control it (if we manage to get this to work at all...)
Turns out if I turn both fan housing around and flip the fans I can get
them both in the system (barely). I have it running like this for now -
but I think it's overkill really, and the CPUs don't break 40C even
under a 24 way kernel compile or four parallel 24 way poky builds.
Post by Jean Delvare
BTW, the Supermicro documentation is pretty clear that fan control is
only supported when using 4-pin fans. Is it what you're using?
Yes, all 4 fans are 4-pin - and they are all the recommended SuperMicro
fans.
Post by Jean Delvare
Post by Darren Hart
I ran pwmconfig again with adm1021, ipmi_si, and w83627ehf unloaded. This
time it detected 8 pwm interfaces, and only pwm1 failed to enter manual mode.
hwmon2/device is w83795g
Ouch. Last time your chip was a W83795ADG (the small version with only
2 fan control outputs) and now you are supposed to have a W83795G (the
big version with 8 fan control outputs.) The Supermicro product
description doesn't tell which is present, but to be fair, I've never
seen a W83795G on a PC mainboard so far, only W83795ADG.
Physical inspection confirms this is a W83795-ADG.
Post by Jean Delvare
Anyway, this suggests unreliable I/O on the SMBus. So even though you
have unloaded ipmi_si, which should guarantee that the Linux host isn't
accessing the chip through IPMI, I suspect that something else is still
accessing the chip in our back. A BMC for remote management?
Correct, this version of the board has a WPCM450R BMC.
Post by Jean Delvare
Didn't you get an error message in the kernel logs related to w83795
register 0x001? This is where the driver gets the chip type from.
Hrm... looking back I see various errors reading ranging from 0x011
through 0x46, but I don't see 0x001.
Post by Jean Delvare
I think I get what's happening. The W83795G/ADG chips have so-called
banked registers, which means that you have to select the right bank
before accessing a given register. To improve register access time, the
driver remembers the currently selected bank, and only selects a
different bank when needed. Now, if somebody else accesses the chip
in our back, this assumption gets wrong suddenly.
That makes sense.
Post by Jean Delvare
I could change the driver to unconditionally set the bank before any
register access, at the price of severely decreased performance.
However, even this would not completely solve the problem, as whoever
else is accessing the chip might do so between the w83795 driver
setting the bank and the w83795 driver reading (or writing) the
register value - and nothing can be done against this.
Yeah, just narrows the race window, not a fix.
Post by Jean Delvare
The bottom line is that using the W83795 driver in a multi-master I2C
setup (and I strongly suspect this is what Supermicro did) is a bad
hardware design mistake. This hardware monitoring device wasn't
designed with this use case in mind.
As this board is available with and without the BMC, I wonder if they
just don't expect people to use the W83795 if they have the BMC? That
would be fine if IPMI could control fan speed, but from what I can tell,
it can only report on it.
Post by Jean Delvare
Post by Darren Hart
hwmon2/device/pwm1
hwmon2/device/pwm1 is currently setup for automatic speed control.
In general, automatic mode is preferred over manual mode, as
it is more efficient and it reacts faster. Are you sure that
you want to setup this output for manual control? (n) y
hwmon2/device/pwm1 stuck to 125
Apr 7 08:40:48 rage kernel: [ 1617.363333] w83795 0-002f: Failed to read from register 0x023, err -6
The driver was temporarily unable to read the in19 value.
Post by Darren Hart
I then searched for the pwm controls manually and tried adjusting them.
I was able reduce fan noise considerably by echo'ing 0 to pwm1, and I
brought it back up by echo'ing 125 to it. I didn't notice any change
Odd, this is exactly what pwmconfig is doing. It's hard to explain how
pwmconfig could consistently fail and your manual attempt worked right
away. It may not work always though?
I did find windows where they were ineffective.
Post by Jean Delvare
Post by Darren Hart
with the other pwms. Also, the fan speed as reported by sensors stayed
constant, even though they obviously had slowed down considerably.
My bet is that you don't have pwm3 to pwm8 anyway, so it's expected
they had no effect.
Post by Darren Hart
# for PWM in $(find . -name "pwm[0-8]"); do echo $PWM; echo 0 > $PWM; echo -n "Off ($(cat $PWM))..."; sleep 5; echo 125 > $PWM; echo "On ($(cat $PWM))"; done
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm1
Off (0)...On (119)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm2
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm3
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm4
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm5
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm6
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm7
Off (0)...On (0)
./devices/pci0000:00/0000:00:1f.3/i2c-0/0-002f/pwm8
Off (0)...On (0)
I ran pwmconfig again... and it didn't complain about pwm1 not entering
manual mode. It was also able to bring the fans up and shut them down
with pwm1. It did NOT detect a correlation however.
This is all consistent with my theory about random bank switches.
Agreed.
Post by Jean Delvare
Post by Darren Hart
--------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35
Enter the high temperature (degree C)
over which the fan should spin at maximum speed (60): /usr/sbin/pwmconfig: line 923: [: -eq: unary operator expected
/usr/sbin/pwmconfig: line 949: [: -eq: unary operator expected
--------------
if [ $FAN_MIN -eq 0 ]
if [ $FAN_MIN -eq 0 ]
Your line numbers don't match mine, which means you aren't using the
latest upstream version of pwmconfig. So I can't help, sorry.
OK, I'll probably wait to hear back from SuperMicro and get back to this
next week. I'll be traveling this coming week (Embedded Linux
Conference) and will be away from the machine. If I have cause to
continue working with pwmconfig, I'll grab the latest and see about
cleaning some of any remaining issues up.
Post by Jean Delvare
Post by Darren Hart
FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`
FAN_MIN=`echo $fanactive_min|cut -d' ' -f$REPLY`
# dvhart debug
if [ -z "$FAN_MIN" ]; then
echo "FAN_MIN detection failed, setting to 0."
FAN_MIN=0
fi
------------
FAN_MIN detection failed, setting to 0.
------------
This certainly explains why a correlation couldn't be found. Your
workaround however is not correct. If fanactive_min has fewer elements
than expected, this means that CURRENT_SPEEDS too, but you don't know
which ones are missing, because CURRENT_SPEEDS is a string, not an
array. We should really be using proper bash arrays for robustness, but
I simply don't have the time to work on this these days.
Overall the pwmconfig (and fancontrol) code isn't good quality, partly
because it started as an afternoon hack and has grown way too old,
partly because writing nice and efficient code in bash can be quite
challenging. I think someone posted on the lm-sensors list to announce
a rewrite in C, which might be a better starting point.
OK, good to know. This seems like a perfect candidate for Python. I like
system scripts to remain easily hackable on a running system, and C
makes that a bit harder. (I'm fine with the language, don't get me
wrong, just for system control, something like Python seems to be a
better fit). Maybe I'll look into that if we can get this driver sorted
out on my whacky board.
Post by Jean Delvare
Post by Darren Hart
------------
Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 35
Enter the high temperature (degree C)
Enter the minimum PWM value (0-255)
at which the fan STOPS spinning (press t to test) (100): t
Now we decrease the PWM value to figure out the lowest usable value.
We will use a slightly greater value as the minimum speed.
------------
After fixing that, the detection of the lowest value (where the fan
stops) ran for 30 minutes without indicating any forward progress or
making an audibly detectable change in fan speed. I tried adjusting
it manually, and was able to make several speed adjustments, finding
This suggests more problems in pwmconfig, it isn't supposed to behave
that way. But again the root cause is probably the kernel driver not
behaving in the standard way pwmconfig expects. In turn caused by the
hardware playing tricks on you.
Post by Darren Hart
48'). Before I could finish, the interface stopped responding to
/usr/sbin/pwmconfig: There are no fan-capable sensor modules installed
# sensors
w83795g-i2c-0-2f
Adapter: SMBus I801 adapter at 0400
beep_enable:enabled
Wow. Your system is very strange. I can't even think of how such an
output would be possible at all.
:-)
Post by Jean Delvare
Post by Darren Hart
Post by Jean Delvare
Does the board manual say whether the case fans are supposed to be
controllable, or only the CPU fans?
"Fan status monitor with firmware control and CPU fan auto-off in sleep mode"
"Pule Width Modulation (PWM) Fan Control"
"The PC health monitor can check the RPM status of the cooling fans. The
onboard CPU and chassis fans are controlled by Thermal Management via BIOS
(under Hardware Monitoring in the Advanced Setting)."
I read this as: all fans should be controllable.
I'm concerned it's intended to be read as:

"BIOS controls the fans and you can see the status in the health
monitor"... hrm perhaps I need to see about running windows on a spare
drive and check out this health monitor thing. If I can reliably control
the fans with that while still using the BMC, it might bode well for
getting this to work.... now where am I going to get a windows CD... hrm...
Post by Jean Delvare
Post by Darren Hart
"The WPCM450R communicates with onboard components via six SMBus interfaces,
fan control, and Platform Environment Control Interface (PECI) buses."
This seems to be a complex setup, unfortunately the block diagram in
the manual mentions neither SMBus nor PECI.
I've asked for help from SuperMicro, we'll see if they're so inclined.
Post by Jean Delvare
Post by Darren Hart
The case fans are definitely controllable given my experiment above on pwm1.
pwm2 doesn't appear to do anything... and I'm not sure what 3-8 are supposed
to do :-)
As said before, I am certain you won't have pwm3-8 at all so they
aren't supposed to do anything.
Post by Darren Hart
Post by Jean Delvare
(...)
$ dmesg | grep 83795
[ 12.643929] i2c i2c-0: Found w83795adg rev. B at 0x2f
[ 12.883789] w83795 0-002f: PECI agent 1 Tbase temperature: 100
[ 12.903779] w83795 0-002f: PECI agent 2 Tbase temperature: 100
[ 2288.932629] w83795 0-002f: Failed to read from register 0x030, err -6
[ 2613.292773] w83795 0-002f: Failed to write to register 0x040, err -6
[ 2693.333461] w83795 0-002f: Failed to read from register 0x01e, err -11
-6 is -ENXIO, returned by the i2c-i801 driver when a slave I2C device
doesn't answer. -11 is -EAGAIN, meaning arbitration loss, which can
happen on multi-master I2C buses, and I guess IPMI is implemented
exactly that way.
Am I doing something wrong?
Yes. You are using IPMI and a native Linux driver to access the same
monitoring chip. Both access methods don't know of each other and are
not synchronized.
OK, I removed the ipmi_si driver early on and am still seeing the
problems described above.
Probably caused by concurrent accesses from the BMC.
Post by Darren Hart
Post by Jean Delvare
Can I provide any additional information to
help narrow down what might be wrong?
Choose between IPMI and native drivers. If you want to use IPMI on this
board, then you have to forget about the w83795 driver. And about
software-driven fan speed control too, I'm afraid.
Does that mean all IPMI features? I'd hate to have to lose SOL and power control.
It's hard to tell what exactly IPMI is doing. Clearly if you want to
use IPMI then the w83795 driver is out IMHO, and you'll suffer from the
lack of integration between IPMI and libsensors.
I don't like that answer ;-)
Post by Jean Delvare
Post by Darren Hart
Post by Jean Delvare
Did you look for a BIOS or IPMI firmware update already?
IPMI is current.
BIOS had an update available. After hunting down a FreeDOS USB boot image, I
managed to flash it. pwmconfig is much happier now, and the sensors report
the fan speed correctly now. pwmconfig walked through the PWM:RPM mapping
for fan2_input, and all three fans dropped along with it. When it started
----------
hwmon2/device/fan4_input ... speed was 4285 now 1058
It appears that fan hwmon2/device/fan4_input
is controlled by pwm hwmon2/device/pwm1
/usr/sbin/pwmconfig: line 464: hwmon2/device: expression recursion level exceeded (error token is "device")
Testing is complete.
----------
line 464
fanactive="$(($j+${fanactive}))" #not supported yet by fancontrol
I had never seen this error message before. But I also don't have the
line above in my copy of pwmconfig either. Are you by any chance using a
packaged version with custom patches?
Possibly, just whatever is in Ubuntu 10.10. See above for my thoughts on
continuing to work with pwmconfig.
Post by Jean Delvare
Post by Darren Hart
fancontrol appears to work now as well. It appears all my fans are connected
to the same PWM control, which is pretty unfortunate, but things are MUCH
better now than they were. It appears there are a few scripting bugs in
pwmconfig (at least in my distro version) that can be corrected with
Please test the upstream version. If you find bugs in your distro
version which aren't upstream, report to them, not us. And please ask
them to push their changes upstream (if they are good) or drop them (if
not.)
Nod.
Post by Jean Delvare
Post by Darren Hart
some string checking, but the core problem appears to be a buggy BIOS -
big surprise ;-)
I don't want to bash your optimism, but... My personal impression is
that there is a severe design issue on this board, which will prevent
you from using the w83795 driver.
Understood, we'll see what SuperMicro has to say.
Post by Jean Delvare
Post by Darren Hart
I am not sure which temperature sensor to use to control pwm1. I don't trust
the temp1 input of 82C, temp5 reads 39 idle, and 7 and 8 read about 25 idle.
While the coretemp sensors read 24-29.
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +39.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +25.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +22.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp5 is the system (board) temperature temp7 is CPU1 and temp8 is
CPU2. I would use temp5 for case fans, and temp7 for CPU fans. A
perfect fan control system would allow you to take the max or average
of multiple temperatures, but we don't support this.
But then again, in your case, software driven fan control seems out of
the question. Way too dangerous when you don't know if you'll be able
to access the monitoring chip the next minute. I really wish board
vendors would let people tweak the automatic fan speed control settings
in the BIOS. Asus offers several profiles, which is better than
nothing, but it would seem fair to let the user set the temperature
limits manually. Sigh.
This board has several profiles as well, and I think original problem
(periodic absurdly loud fans) stems from the poorly cooled north bridge.
Post by Jean Delvare
Post by Darren Hart
# sensors | grep Core
Core 0: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +27.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +28.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +23.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +17.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +20.0?C (high = +81.0?C, crit = +101.0?C)
And as I'm typing this, dmesg started spewing a lot of errors and temp1-5 now report 0?C
[ 1056.545180] w83795 0-002f: Failed to write to register 0x040, err -6
[ 1056.585158] w83795 0-002f: Failed to read from register 0x041, err -6
[ 1056.605143] w83795 0-002f: Failed to read from register 0x042, err -6
[ 1056.645123] w83795 0-002f: Failed to read from register 0x043, err -6
[ 1056.685094] w83795 0-002f: Failed to read from register 0x044, err -6
[ 1056.705084] w83795 0-002f: Failed to read from register 0x045, err -6
[ 1056.745057] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1056.765044] w83795 0-002f: Failed to write to register 0x040, err -6
....
[ 1060.442767] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.482745] w83795 0-002f: Failed to set bank to 2, err -6
[ 1060.502728] w83795 0-002f: Failed to set bank to 2, err -6
...
[ 1060.702605] w83795 0-002f: Failed to read from register 0x040, err -6
[ 1060.722590] w83795 0-002f: Failed to read from register 0x046, err -6
[ 1060.762569] w83795 0-002f: Failed to write to register 0x040, err -6
...
and on for pages.
Reloading w83795 stops the messages, but the w83795 sensors don't come back.
OK, that's a ton of data, hopefully it's good data.
Oh, I suddenly have an idea what may be going on. If I'm right, it even
worse than I thought at first.
I guess that your SMBus is multiplexed. The errors -6 (-ENXIO) mean the
W83795ADG chip is unreachable, presumably because the multiplexer was
switched to a different segment. If the multiplexer is out of the
operating system's control (as seems to be the case here) then you
really have to give up the w83795 driver, much to my despair.
So this board without the BMC option may very well work just fine. Sigh.
Post by Jean Delvare
You may be able to get the w83795 driver working again by invoking
ipmitool. If IPMI know how to switch back to the right SMBus segment,
it may leave it selected afterwards. But anyway this is just a trick,
nothing you can rely on in the long run, as the conflict between w83795
and the BMC isn't one we can solve.
"ipmi sensor" stops reporting data once it goes AWOL as well.
Post by Jean Delvare
It might be the right time for you to ask the Supermicro support for a
detailed topology of the I2C/SMBus on this board.
Done.

Thanks Jean,
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Jean Delvare
2011-04-12 12:16:55 UTC
Permalink
Hi Darren,
Post by Darren Hart
Hey Jean,
I really appreciate your thoughts here. I'll respond inline, but let me
give a summary. I've contacted SuperMicro and am hoping they'll get back
to with a contact to help get some answer regarding how IPMI (WPCM450R)
and W83795-ADG (I checked the chip, -ADG) are supposed to interact and
still allow the OS to read temperature and control fans.
You are correct about temp1, that has to be the northbridge, it is
located right behind the PCI-E slots (which appears to be common
practice) and has a very inadequate heat sink. I'm considering replacing
it with a much more substantial heatsink and possible adding a tunnel to
direct air over it. I've asked SuperMicro for a recommendation here as
FWIW, I was able to decrease the north bridge temperature on my own
dual-Xeon board by replacing the front case from a 59 m3/h model to a
92 m3/h model. So the air flow in the case definitely matters.
Post by Darren Hart
well. If I can get that temperature down, my guess is the BIOS fan
control might be able to do a much better job and I won't need the
w83795-adg fancontrol from the OS quite so bad.
This is certainly true.
Post by Darren Hart
Post by Jean Delvare
Post by Darren Hart
(...)
$ sensors | grep ?C
Core 0: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +26.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +22.0?C (high = +81.0?C, crit = +101.0?C)
temp1: +40.0?C (high = +138.0?C, hyst = +96.0?C) sensor = thermistor
temp2: -61.0?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp3: +36.5?C (high = +80.0?C, hyst = +75.0?C) sensor = thermistor
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
(crit = +127.0?C, hyst = +127.0?C) sensor = thermal diode
temp5: +35.8?C (high = +127.0?C, hyst = +127.0?C)
(crit = +75.0?C, hyst = +70.0?C) sensor = thermistor
temp7: +24.8?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
temp8: +23.0?C (high = +95.0?C, hyst = +92.0?C)
(crit = +95.0?C, hyst = +92.0?C) sensor = Intel PECI
Core 9: +25.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 0: +24.0?C (high = +81.0?C, crit = +101.0?C)
Core 1: +21.0?C (high = +81.0?C, crit = +101.0?C)
Core 2: +20.0?C (high = +81.0?C, crit = +101.0?C)
Core 8: +15.0?C (high = +81.0?C, crit = +101.0?C)
Core 9: +22.0?C (high = +81.0?C, crit = +101.0?C)
Core 10: +19.0?C (high = +81.0?C, crit = +101.0?C)
Unrelated to your issue, but the core numbering by coretemp is
surprising. I'm curious if you see the same in /proc/cpuinfo.
No I do not. The Core ID you see above refers to physical cores per
one of the authors of coretemp about it. There appears to be some effort
ongoing to try and get those numbers to align with what is used in the
rest of the system to identify CPUs. Note that cpuinfo lists 24 CPUs due
to hyper-threading, while coretemp is only concerned with physical cores.
It's correct that the coretemp driver skips hyperthread siblings. But
the core numbering is supposed to be correct (i.e. in line
with /proc/cpuinfo) since kernel 2.6.35. And it works fine for me.
Post by Darren Hart
Post by Jean Delvare
Post by Darren Hart
(...)
I read somewhere during my hours of searching for a solution to this that
both CPU fans are controlled by the same pwm signal, so that is not
surprising. It's too bad about the case fans though, I really like to run
the larger quiet fan up before bringing up the smaller front fan, but,
it is what it is.
As you don't seem to be using the second CPU fan header, you could
cheat and plug your large rear fan in this header, so pwm1 would
control it (if we manage to get this to work at all...)
Turns out if I turn both fan housing around and flip the fans I can get
them both in the system (barely). I have it running like this for now -
but I think it's overkill really, and the CPUs don't break 40C even
under a 24 way kernel compile or four parallel 24 way poky builds.
My limited experience with similar hardware is that the CPUs don't heat
much, and you have to focus on board (mainly north bridge) cooling and
not CPU cooling.
Post by Darren Hart
Post by Jean Delvare
Didn't you get an error message in the kernel logs related to w83795
register 0x001? This is where the driver gets the chip type from.
Hrm... looking back I see various errors reading ranging from 0x011
through 0x46, but I don't see 0x001.
On a second thought, that's possible. In case of a bank mismatch, the
driver won't even notice the problem and won't report any error. Just,
you'll get the value read from (or worse, written to) a different
register in the chip.
Post by Darren Hart
(...)
As this board is available with and without the BMC, I wonder if they
just don't expect people to use the W83795 if they have the BMC? That
Maybe, yes.
Post by Darren Hart
would be fine if IPMI could control fan speed, but from what I can tell,
it can only report on it.
I'm not familiar with IPMI, sorry, but indeed I've never heard of fan
speed control using this way.

But then again, if vendors would just let us select thermal trip points
for fan speed control in the BIOS, I think we could live without fan
control support on the OS side. Sigh.
--
Jean Delvare
http://khali.linux-fr.org/wishlist.html
Darren Hart
2011-04-15 05:04:58 UTC
Permalink
Post by Jean Delvare
The bottom line is that using the W83795 driver in a multi-master I2C
setup (and I strongly suspect this is what Supermicro did) is a bad
hardware design mistake. This hardware monitoring device wasn't
designed with this use case in mind.
Super Micro responded:
"
Do you have an extra fan blowing air toward the northbridge
heatsink. The temperature on northbridage heatsink must less than 75
degree. Adding an extra fan which will help solve your issue.
We are not recommend user using lmsensor on X8DTL-IF. It will
cause system crash due to lmsensor and our IPMI program getting
information from BIOS at the same time and collide on each other. It
won't happen immediately but definitely will happen in random time.
"

It's a bit broken, but it sounds like they are confirming you theory.

As an experiment I removed the CPU2 fan and pointed it directly at the
Intel 5520 chipset (technically not a Northbridge as it turns out...
just ignore that intel.com in my email address, it means nothing ;-) and
while I haven't been able to measure the temp1 reading from the w83795
driver since my return, the fans no longer ramp up to 4k rpm and the
chip is cool to the touch.

I'm seeking the recommended solution from Super Micro, failing that,
I'll have to resort to chassis modding.... I thought that was for the
overclocking-acrylic-window-neon-lights crowd.... sigh.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Darren Hart
2011-04-15 05:30:53 UTC
Permalink
Post by Darren Hart
Post by Jean Delvare
The bottom line is that using the W83795 driver in a multi-master I2C
setup (and I strongly suspect this is what Supermicro did) is a bad
hardware design mistake. This hardware monitoring device wasn't
designed with this use case in mind.
"
Do you have an extra fan blowing air toward the northbridge
heatsink. The temperature on northbridage heatsink must less than 75
degree. Adding an extra fan which will help solve your issue.
We are not recommend user using lmsensor on X8DTL-IF. It will
cause system crash due to lmsensor and our IPMI program getting
information from BIOS at the same time and collide on each other. It
won't happen immediately but definitely will happen in random time.
"
It's a bit broken, but it sounds like they are confirming you theory.
As an experiment I removed the CPU2 fan and pointed it directly at the
Intel 5520 chipset (technically not a Northbridge as it turns out...
just ignore that intel.com in my email address, it means nothing ;-) and
while I haven't been able to measure the temp1 reading from the w83795
driver since my return, the fans no longer ramp up to 4k rpm and the
chip is cool to the touch.
I'm seeking the recommended solution from Super Micro, failing that,
I'll have to resort to chassis modding.... I thought that was for the
overclocking-acrylic-window-neon-lights crowd.... sigh.
This is turning into a support issue for Super Micro, but I thought I'd
post the following for completeness.

After trying a different kernel, I was able to get reading from the
w83795 again. I applied the fan to the chipset until it reached it's
lowest point (52.5C while idle). I then positioned the fan away from the
chipset and watched the temperature rise until it reached 84.5C and the
fans sped up to > 4000RPM.

FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000


Given that the system is idle, and Super Mictro stated the chipset
should not exceed 75C, and I have no obstructions in the case and no
expansion boards to add heat, something appears to be wrong.

Here is an annotated log of the experiment, one reading every 10 seconds:

dvhart at rage:~$ while true; do sensors w83795g-i2c-0-2f | grep temp1;
sleep 10; done
temp1: +61.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +59.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +57.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +54.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +58.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +60.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +61.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +63.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +64.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +65.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +66.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +67.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +68.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +69.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +70.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +71.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +73.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +74.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +76.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)

Fan speed jumped up at this point:

FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000

And stayed at high speed until:

temp1: +79.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)

And sped up again here.
And so on.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Jean Delvare
2011-04-15 07:59:33 UTC
Permalink
Post by Darren Hart
After trying a different kernel, I was able to get reading from the
w83795 again. I applied the fan to the chipset until it reached it's
lowest point (52.5C while idle). I then positioned the fan away from the
chipset and watched the temperature rise until it reached 84.5C and the
fans sped up to > 4000RPM.
FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
Given that the system is idle, and Super Mictro stated the chipset
should not exceed 75C, and I have no obstructions in the case and no
expansion boards to add heat, something appears to be wrong.
dvhart at rage:~$ while true; do sensors w83795g-i2c-0-2f | grep temp1;
sleep 10; done
temp1: +61.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +59.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +57.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +54.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +58.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +60.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +61.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +63.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +64.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +65.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +66.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +67.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +68.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +69.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +70.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +71.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +73.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +74.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +76.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
temp1: +79.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
And sped up again here.
And so on.
The W83795ADG can be programmed to switch fans to full speed when
certain temperature limits are exceeded. The driver doesn't currently
expose these settings, but my guess is that's what you're seeing.
According to the datasheet, the default value for temperature limit
registers for this mechanism is 0x50, that is... 80?C.
--
Jean Delvare
Darren Hart
2011-04-15 14:11:31 UTC
Permalink
Post by Jean Delvare
Post by Darren Hart
After trying a different kernel, I was able to get reading from the
w83795 again. I applied the fan to the chipset until it reached it's
lowest point (52.5C while idle). I then positioned the fan away from the
chipset and watched the temperature rise until it reached 84.5C and the
fans sped up to > 4000RPM.
FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
Given that the system is idle, and Super Mictro stated the chipset
should not exceed 75C, and I have no obstructions in the case and no
expansion boards to add heat, something appears to be wrong.
dvhart at rage:~$ while true; do sensors w83795g-i2c-0-2f | grep temp1;
sleep 10; done
temp1: +61.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +59.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +57.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +54.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +52.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +53.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +55.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +56.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +58.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +60.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +61.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +63.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +64.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +65.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +66.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +67.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +68.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +69.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +70.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +71.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +72.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +73.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +74.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +75.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +76.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +77.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +78.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +82.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +83.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +84.5?C (high = +127.0?C, hyst = +127.0?C)
FAN 1 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 2 | 0.000 | RPM | nr | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 3 | 2401.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 4 | 4356.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
FAN 5 | 3969.000 | RPM | ok | 400.000 | 576.000
| 784.000 | 33856.000 | 34225.000 | 34596.000
temp1: +79.2?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.8?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +79.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.0?C (high = +127.0?C, hyst = +127.0?C)
temp1: +80.5?C (high = +127.0?C, hyst = +127.0?C)
temp1: +81.0?C (high = +127.0?C, hyst = +127.0?C)
And sped up again here.
And so on.
The W83795ADG can be programmed to switch fans to full speed when
certain temperature limits are exceeded. The driver doesn't currently
expose these settings, but my guess is that's what you're seeing.
According to the datasheet, the default value for temperature limit
registers for this mechanism is 0x50, that is... 80?C.
Which is consistent with Super Micro saying the chipset must remain
below 75C. A fan brings it down 30C. I'm looking into adding a fan or
replacing the heat sink, or both. With that, I'll be giving up on
fancontrol for this machine - since it behaves itself just fine when the
chipset isn't overheating.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Loading...