AIDA
GELINA
BRIKEN
nToF
CRIB
ISOLDE
CIRCE
nTOFCapture
DESPEC
DTAS
EDI_PSA
179Ta
CARME
StellarModelling
DCF
K40
DESPEC
Draft saved at 00:00:00
Fields marked with
*
are required
Entry time:
Fri Mar 12 23:56:00 2021
Author
*
:
Subject
*
:
24:00 Joined in with TD to debug the dropout of AIDA05 - See https://elog.ph.ed.ac.uk/DESPEC/193 After both powercycles and telnet reboots we were still seeing 0 in the statistics but did note that the counter was > 0 The last check we did was a restart of the merger server. This solved the issue. It is likely that the link to the FEE was dropped but not re-established upon the resets. There was no error message that this was seen. From now on I recommend refreshing the statistics ever 30 minutes. 01:29 System wide check N.B. May not have reset baselines after reset WR fault Base Current Difference aida01 fault 0x1ad : 0x1af : 2 aida02 fault 0x4434 : 0x4436 : 2 aida03 fault 0xcf95 : 0xcf99 : 4 aida04 fault 0x25aa : 0x25ae : 4 aida05 fault 0x4b4 : 0x4b7 : 3 aida06 fault 0x834 : 0x837 : 3 aida07 fault 0x146d : 0x1472 : 5 aida08 fault 0xf4a8 : 0xf4ab : 3 aida09 fault 0x4f6f : 0x4f73 : 4 aida10 fault 0x7504 : 0x7506 : 2 aida11 fault 0x26fa : 0x26fc : 2 aida12 fault 0x2858 : 0x285b : 3 White Rabbit error counter test result: Passed 0, Failed 12 Understand the status reports as follows:- Status bit 3 : White Rabbit decoder detected an error in the received data Status bit 2 : Firmware registered WR error, no reload of Timestamp Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR FPGA Check Base Current Difference aida12 fault 0x0 : 0x4 : 4 FPGA Timestamp error counter test result: Passed 11, Failed 1 If any of these counts are reported as in error The ASIC readout system has detected a timeslip. That is the timestamp read from the time FIFO is not younger than the last 01:42 Statistics - attachment 1 Temperature - attachment 2 Bias and leakage current - attachment 3 03:57 System wide checks Base Current Difference aida07 fault 0x1472 : 0x1473 : 1 White Rabbit error counter test result: Passed 11, Failed 1 Understand the status reports as follows:- Status bit 3 : White Rabbit decoder detected an error in the received data Status bit 2 : Firmware registered WR error, no reload of Timestamp Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR Base Current Difference aida12 fault 0x4 : 0xc : 8 FPGA Timestamp error counter test result: Passed 11, Failed 1 If any of these counts are reported as in error The ASIC readout system has detected a timeslip. That is the timestamp read from the time FIFO is not younger than the last Statistics - attachment 4 Temp - attachment 5 Bias - attachment 6 05:27 System wide checks Base Current Difference aida07 fault 0x1472 : 0x1474 : 2 White Rabbit error counter test result: Passed 11, Failed 1 Understand the status reports as follows:- Status bit 3 : White Rabbit decoder detected an error in the received data Status bit 2 : Firmware registered WR error, no reload of Timestamp Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR Base Current Difference aida07 fault 0x0 : 0x1 : 1 aida12 fault 0x4 : 0x16 : 18 FPGA Timestamp error counter test result: Passed 10, Failed 2 If any of these counts are reported as in error The ASIC readout system has detected a timeslip. That is the timestamp read from the time FIFO is not younger than the last Statistics - attachment 7 Temp - attachment 8 Bias - attachment 9 05:51 Looking into the bad timestamp messages in the merger e.g. MERGE Data Link (28348): bad timestamp 5 3 0xc14f8415 0x00eea280 0x0000cd2e70eea280 0x166bcd2e70eea280 0x166bcd2e716c1f80 Looking in the merger source for the bad timestamp message: if (op->Time < LastTimeStamp) { // invalid time stamp (*StatsMem[TSSEQERR])++; (*StatsMem[TSSEQERR+((LinkNum+1)*MAXCOUNTERS)])++; sprintf(message_buffer, "bad timestamp %d %d 0x%08lx 0x%08lx 0x%016llx 0x%016llx 0x%016llx",LinkNum, link_table[LinkNum]->link_state, op->Data, op->Timestamp, INFO4, op->Time, LastTimeStamp); report_message(MSG_WARNING); /***************************/ // LastTimeStamp = op->Time; So the message is generated when the merger detects a timewarp. Took the first warning a data block of errors (The first instance of that particular `LastTimeStamp` and c alculated the time difference between the new timestamp and the last timestamp The time and LastTimestamp were 0x166bccef3f424a50 0x166bccef4f4232e0 respectively The time difference between them is 268429456 which is 268ms which seems quite a large difference 06:39 AIDA Crashed This time all FEEs were responsive but not showing any stats When stopping the DAQ all FEEs except aida06 stopped - attachment 10 Did a reset of the DAQ and all recovered but no stats on aida06 - attachment 11 Regained DAQ with a powercycle and a complete restart of the AIDA:8115 Merger and TaperServer It is worth noting that aida06 is connected to link 5 the data link which had been producing the bad merge messages overnight. We have now had it in aida05, aida6 and aida07. Could it be to do with the correlation scaler rate going into these FEES? Going through the var/log/messages on aida-3 aida06 rebooted itself at 06:37 Mar 13 06:37:16 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:918 for /home/Embedded/XilinxLinux/ppc_4xx/rfs/aida06 (/home/Embedded/XilinxLinux/ppc_4xx/rfs) Mar 13 06:37:18 aidas-gsi xinetd[4578]: START: time-stream pid=0 from=::ffff:192.168.11.6 Mar 13 06:37:32 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:862 for /home/npg/MIDAS_Releases/23Jan19/MIDAS_200119 (/home/npg/MIDAS_Releases/23Jan19/MIDAS_200119) Looking in /var/log/messages on aida06 no evidence of a reason why: Mar 12 23:30:56 aida06 kernel: Trying to free nonexistent resource <0000000007000000-0000000007ffffff> Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: mem region start 0x7000000 for 0x1000000 mapped at 0xd2380000 Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: driver assigned major number 253 Mar 12 23:31:13 aida06 kernel: xaida: open: Mar 12 23:31:14 aida06 kernel: AIDAMEM: aidamem_open: Mar 12 23:33:06 aida06 kernel: xaida: open: Mar 13 05:37:30 aida06 syslogd 1.4.2: restart. Mar 13 05:37:30 aida06 kernel: klogd 1.4.2, log source = /proc/kmsg started. Mar 13 05:37:30 aida06 kernel: Using Xilinx Virtex440 machine description Mar 13 05:37:30 aida06 kernel: Linux version 2.6.31 (nf@nnlxb.dl.ac.uk) (gcc version 4.2.2) #34 PREEMPT Tue Nov 15 15:57:04 GMT 2011 Mar 13 05:37:30 aida06 kernel: Zone PFN ranges:
Encoding
:
HTML
ELCode
plain
Suppress Email notification
Resubmit as new entry
Attachment 1:
210313_0138_Stats.png
Original size: 1913x463
Attachment 2:
210313_0141_Temp.png
Original size: 1910x474
Attachment 3:
210313_0142_Bias.png
Original size: 503x329
Attachment 4:
210313_0355_Stats.png
Original size: 1906x438
Attachment 5:
210313_0356_Bias.png
Original size: 494x311
Attachment 6:
210313_0356_Temp.png
Original size: 1904x455
Attachment 7:
210313_0528_Stats.png
Original size: 1912x436
Attachment 8:
210313_0529_Temp.png
Original size: 1907x475
Attachment 9:
210313_0530_Bias.png
Original size: 485x320
Attachment 10:
2210313_0640_AIDAissue.png
Original size: 1907x702
Attachment 11:
210313_0647_aida6MMissing.png
Original size: 1907x437
Attachment 12:
Drop attachments here...
Draft saved at 00:00:00
ELOG V3.1.4-unknown