24:00 Joined in with TD to debug the dropout of AIDA05 - See https://elog.ph.ed.ac.uk/DESPEC/193
After both powercycles and telnet reboots we were still seeing 0 in the statistics but did note that the counter was > 0
The last check we did was a restart of the merger server. This solved the issue.
It is likely that the link to the FEE was dropped but not re-established upon the resets.
There was no error message that this was seen. From now on I recommend refreshing the statistics ever 30 minutes.
01:29 System wide check
N.B. May not have reset baselines after reset
WR fault
Base Current Difference
aida01 fault 0x1ad : 0x1af : 2
aida02 fault 0x4434 : 0x4436 : 2
aida03 fault 0xcf95 : 0xcf99 : 4
aida04 fault 0x25aa : 0x25ae : 4
aida05 fault 0x4b4 : 0x4b7 : 3
aida06 fault 0x834 : 0x837 : 3
aida07 fault 0x146d : 0x1472 : 5
aida08 fault 0xf4a8 : 0xf4ab : 3
aida09 fault 0x4f6f : 0x4f73 : 4
aida10 fault 0x7504 : 0x7506 : 2
aida11 fault 0x26fa : 0x26fc : 2
aida12 fault 0x2858 : 0x285b : 3
White Rabbit error counter test result: Passed 0, Failed 12
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
FPGA Check
Base Current Difference
aida12 fault 0x0 : 0x4 : 4
FPGA Timestamp error counter test result: Passed 11, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
01:42 Statistics - attachment 1
Temperature - attachment 2
Bias and leakage current - attachment 3
03:57 System wide checks
Base Current Difference
aida07 fault 0x1472 : 0x1473 : 1
White Rabbit error counter test result: Passed 11, Failed 1
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida12 fault 0x4 : 0xc : 8
FPGA Timestamp error counter test result: Passed 11, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Statistics - attachment 4
Temp - attachment 5
Bias - attachment 6
05:27 System wide checks
Base Current Difference
aida07 fault 0x1472 : 0x1474 : 2
White Rabbit error counter test result: Passed 11, Failed 1
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida07 fault 0x0 : 0x1 : 1
aida12 fault 0x4 : 0x16 : 18
FPGA Timestamp error counter test result: Passed 10, Failed 2
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Statistics - attachment 7
Temp - attachment 8
Bias - attachment 9
05:51 Looking into the bad timestamp messages in the merger e.g.
MERGE Data Link (28348): bad timestamp 5 3 0xc14f8415 0x00eea280 0x0000cd2e70eea280 0x166bcd2e70eea280 0x166bcd2e716c1f80
Looking in the merger source for the bad timestamp message:
if (op->Time < LastTimeStamp) {
// invalid time stamp
(*StatsMem[TSSEQERR])++;
(*StatsMem[TSSEQERR+((LinkNum+1)*MAXCOUNTERS)])++;
sprintf(message_buffer, "bad timestamp %d %d 0x%08lx 0x%08lx 0x%016llx 0x%016llx 0x%016llx",LinkNum, link_table[LinkNum]->link_state, op->Data, op->Timestamp, INFO4, op->Time, LastTimeStamp);
report_message(MSG_WARNING); /***************************/
// LastTimeStamp = op->Time;
So the message is generated when the merger detects a timewarp.
Took the first warning a data block of errors (The first instance of that particular `LastTimeStamp` and c alculated the time difference between the new timestamp and the last timestamp
The time and LastTimestamp were 0x166bccef3f424a50 0x166bccef4f4232e0 respectively
The time difference between them is 268429456 which is 268ms which seems quite a large difference
06:39 AIDA Crashed
This time all FEEs were responsive but not showing any stats
When stopping the DAQ all FEEs except aida06 stopped - attachment 10
Did a reset of the DAQ and all recovered but no stats on aida06 - attachment 11
Regained DAQ with a powercycle and a complete restart of the AIDA:8115 Merger and TaperServer
It is worth noting that aida06 is connected to link 5 the data link which had been producing the bad merge messages overnight.
We have now had it in aida05, aida6 and aida07. Could it be to do with the correlation scaler rate going into these FEES?
Going through the var/log/messages on aida-3 aida06 rebooted itself at 06:37
Mar 13 06:37:16 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:918 for /home/Embedded/XilinxLinux/ppc_4xx/rfs/aida06 (/home/Embedded/XilinxLinux/ppc_4xx/rfs)
Mar 13 06:37:18 aidas-gsi xinetd[4578]: START: time-stream pid=0 from=::ffff:192.168.11.6
Mar 13 06:37:32 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:862 for /home/npg/MIDAS_Releases/23Jan19/MIDAS_200119 (/home/npg/MIDAS_Releases/23Jan19/MIDAS_200119)
Looking in /var/log/messages on aida06 no evidence of a reason why:
Mar 12 23:30:56 aida06 kernel: Trying to free nonexistent resource <0000000007000000-0000000007ffffff>
Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: mem region start 0x7000000 for 0x1000000 mapped at 0xd2380000
Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: driver assigned major number 253
Mar 12 23:31:13 aida06 kernel: xaida: open:
Mar 12 23:31:14 aida06 kernel: AIDAMEM: aidamem_open:
Mar 12 23:33:06 aida06 kernel: xaida: open:
Mar 13 05:37:30 aida06 syslogd 1.4.2: restart.
Mar 13 05:37:30 aida06 kernel: klogd 1.4.2, log source = /proc/kmsg started.
Mar 13 05:37:30 aida06 kernel: Using Xilinx Virtex440 machine description
Mar 13 05:37:30 aida06 kernel: Linux version 2.6.31 (nf@nnlxb.dl.ac.uk) (gcc version 4.2.2) #34 PREEMPT Tue Nov 15 15:57:04 GMT 2011
Mar 13 05:37:30 aida06 kernel: Zone PFN ranges: |