AIDA
GELINA
BRIKEN
nToF
CRIB
ISOLDE
CIRCE
nTOFCapture
DESPEC
DTAS
EDI_PSA
179Ta
CARME
StellarModelling
DCF
K40
DESPEC
Draft saved at 00:00:00
Fields marked with
*
are required
Entry time:
Sat Mar 29 11:28:54 2025
Author
*
:
Subject
*
:
> 24:00 Joined in with TD to debug the dropout of AIDA05 - See https://elog.ph.ed.ac.uk/DESPEC/193 > After both powercycles and telnet reboots we were still seeing 0 in the statistics but did note that the counter was > 0 > The last check we did was a restart of the merger server. This solved the issue. > It is likely that the link to the FEE was dropped but not re-established upon the resets. > There was no error message that this was seen. From now on I recommend refreshing the statistics ever 30 minutes. > > 01:29 System wide check > N.B. May not have reset baselines after reset > WR fault > Base Current Difference > aida01 fault 0x1ad : 0x1af : 2 > aida02 fault 0x4434 : 0x4436 : 2 > aida03 fault 0xcf95 : 0xcf99 : 4 > aida04 fault 0x25aa : 0x25ae : 4 > aida05 fault 0x4b4 : 0x4b7 : 3 > aida06 fault 0x834 : 0x837 : 3 > aida07 fault 0x146d : 0x1472 : 5 > aida08 fault 0xf4a8 : 0xf4ab : 3 > aida09 fault 0x4f6f : 0x4f73 : 4 > aida10 fault 0x7504 : 0x7506 : 2 > aida11 fault 0x26fa : 0x26fc : 2 > aida12 fault 0x2858 : 0x285b : 3 > White Rabbit error counter test result: Passed 0, Failed 12 > > Understand the status reports as follows:- > Status bit 3 : White Rabbit decoder detected an error in the received data > Status bit 2 : Firmware registered WR error, no reload of Timestamp > Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR > > FPGA Check > Base Current Difference > aida12 fault 0x0 : 0x4 : 4 > FPGA Timestamp error counter test result: Passed 11, Failed 1 > If any of these counts are reported as in error > The ASIC readout system has detected a timeslip. > That is the timestamp read from the time FIFO is not younger than the last > > 01:42 Statistics - attachment 1 > Temperature - attachment 2 > Bias and leakage current - attachment 3 > > 03:57 System wide checks > > Base Current Difference > aida07 fault 0x1472 : 0x1473 : 1 > White Rabbit error counter test result: Passed 11, Failed 1 > > Understand the status reports as follows:- > Status bit 3 : White Rabbit decoder detected an error in the received data > Status bit 2 : Firmware registered WR error, no reload of Timestamp > Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR > > Base Current Difference > aida12 fault 0x4 : 0xc : 8 > FPGA Timestamp error counter test result: Passed 11, Failed 1 > If any of these counts are reported as in error > The ASIC readout system has detected a timeslip. > That is the timestamp read from the time FIFO is not younger than the last > > Statistics - attachment 4 > Temp - attachment 5 > Bias - attachment 6 > > 05:27 System wide checks > > Base Current Difference > aida07 fault 0x1472 : 0x1474 : 2 > White Rabbit error counter test result: Passed 11, Failed 1 > > Understand the status reports as follows:- > Status bit 3 : White Rabbit decoder detected an error in the received data > Status bit 2 : Firmware registered WR error, no reload of Timestamp > Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR > > Base Current Difference > aida07 fault 0x0 : 0x1 : 1 > aida12 fault 0x4 : 0x16 : 18 > FPGA Timestamp error counter test result: Passed 10, Failed 2 > If any of these counts are reported as in error > The ASIC readout system has detected a timeslip. > That is the timestamp read from the time FIFO is not younger than the last > > > Statistics - attachment 7 > Temp - attachment 8 > Bias - attachment 9 > > 05:51 Looking into the bad timestamp messages in the merger e.g. > MERGE Data Link (28348): bad timestamp 5 3 0xc14f8415 0x00eea280 0x0000cd2e70eea280 0x166bcd2e70eea280 0x166bcd2e716c1f80 > > Looking in the merger source for the bad timestamp message: > > if (op->Time < LastTimeStamp) { > > // invalid time stamp > > (*StatsMem[TSSEQERR])++; > > (*StatsMem[TSSEQERR+((LinkNum+1)*MAXCOUNTERS)])++; > > sprintf(message_buffer, "bad timestamp %d %d 0x%08lx 0x%08lx 0x%016llx 0x%016llx 0x%016llx",LinkNum, link_table[LinkNum]->link_state, op->Data, op->Timestamp, INFO4, op->Time, LastTimeStamp); > > report_message(MSG_WARNING); /***************************/ > > // LastTimeStamp = op->Time; > > > So the message is generated when the merger detects a timewarp. > > Took the first warning a data block of errors (The first instance of that particular `LastTimeStamp` and c alculated the time difference between the new timestamp and the last timestamp > The time and LastTimestamp were 0x166bccef3f424a50 0x166bccef4f4232e0 respectively > The time difference between them is 268429456 which is 268ms which seems quite a large difference > > > 06:39 AIDA Crashed > This time all FEEs were responsive but not showing any stats > When stopping the DAQ all FEEs except aida06 stopped - attachment 10 > Did a reset of the DAQ and all recovered but no stats on aida06 - attachment 11 > Regained DAQ with a powercycle and a complete restart of the AIDA:8115 Merger and TaperServer > > It is worth noting that aida06 is connected to link 5 the data link which had been producing the bad merge messages overnight. > We have now had it in aida05, aida6 and aida07. Could it be to do with the correlation scaler rate going into these FEES? > > Going through the var/log/messages on aida-3 aida06 rebooted itself at 06:37 > Mar 13 06:37:16 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:918 for /home/Embedded/XilinxLinux/ppc_4xx/rfs/aida06 (/home/Embedded/XilinxLinux/ppc_4xx/rfs) > Mar 13 06:37:18 aidas-gsi xinetd[4578]: START: time-stream pid=0 from=::ffff:192.168.11.6 > Mar 13 06:37:32 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:862 for /home/npg/MIDAS_Releases/23Jan19/MIDAS_200119 (/home/npg/MIDAS_Releases/23Jan19/MIDAS_200119) > > Looking in /var/log/messages on aida06 no evidence of a reason why: > Mar 12 23:30:56 aida06 kernel: Trying to free nonexistent resource <0000000007000000-0000000007ffffff> > Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: mem region start 0x7000000 for 0x1000000 mapped at 0xd2380000 > Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: driver assigned major number 253 > Mar 12 23:31:13 aida06 kernel: xaida: open: > Mar 12 23:31:14 aida06 kernel: AIDAMEM: aidamem_open: > Mar 12 23:33:06 aida06 kernel: xaida: open: > Mar 13 05:37:30 aida06 syslogd 1.4.2: restart. > Mar 13 05:37:30 aida06 kernel: klogd 1.4.2, log source = /proc/kmsg started. > Mar 13 05:37:30 aida06 kernel: Using Xilinx Virtex440 machine description > Mar 13 05:37:30 aida06 kernel: Linux version 2.6.31 (nf@nnlxb.dl.ac.uk) (gcc version 4.2.2) #34 PREEMPT Tue Nov 15 15:57:04 GMT 2011 > Mar 13 05:37:30 aida06 kernel: Zone PFN ranges:
Encoding
:
HTML
ELCode
plain
Suppress Email notification
Attachment 1:
Drop attachments here...
Draft saved at 00:00:00
ELOG V3.1.4-unknown