AIDA GELINA BRIKEN nToF CRIB ISOLDE CIRCE nTOFCapture DESPEC DTAS EDI_PSA 179Ta CARME StellarModelling DCF K40
  DESPEC  ELOG logo
Message ID: 194     Entry time: Fri Mar 12 23:56:00 2021
Author: OH 
Subject: Saturday March 13th 00:00-08:00 
24:00 Joined in with TD to debug the dropout of AIDA05 - See https://elog.ph.ed.ac.uk/DESPEC/193
      After both powercycles and telnet reboots we were still seeing 0 in the statistics but did note that the counter was > 0
      The last check we did was a restart of the merger server. This solved the issue.
      It is likely that the link to the FEE was dropped but not re-established upon the resets.
      There was no error message that this was seen. From now on I recommend refreshing the statistics ever 30 minutes.

01:29 System wide check
N.B. May not have reset baselines after reset
WR fault
		 Base 		Current 	Difference
aida01 fault 	 0x1ad : 	 0x1af : 	 2  
aida02 fault 	 0x4434 : 	 0x4436 : 	 2  
aida03 fault 	 0xcf95 : 	 0xcf99 : 	 4  
aida04 fault 	 0x25aa : 	 0x25ae : 	 4  
aida05 fault 	 0x4b4 : 	 0x4b7 : 	 3  
aida06 fault 	 0x834 : 	 0x837 : 	 3  
aida07 fault 	 0x146d : 	 0x1472 : 	 5  
aida08 fault 	 0xf4a8 : 	 0xf4ab : 	 3  
aida09 fault 	 0x4f6f : 	 0x4f73 : 	 4  
aida10 fault 	 0x7504 : 	 0x7506 : 	 2  
aida11 fault 	 0x26fa : 	 0x26fc : 	 2  
aida12 fault 	 0x2858 : 	 0x285b : 	 3  
White Rabbit error counter test result: Passed 0, Failed 12

Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR

FPGA Check
			 Base 		Current 		Difference
aida12 fault 	 0x0 : 	 0x4 : 	 4  
FPGA Timestamp error counter test result: Passed 11, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last

01:42 Statistics - attachment 1
      Temperature - attachment 2
      Bias and leakage current - attachment 3

03:57 System wide checks

		 Base 		Current 	Difference
aida07 fault 	 0x1472 : 	 0x1473 : 	 1  
White Rabbit error counter test result: Passed 11, Failed 1

Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR

			 Base 		Current 		Difference
aida12 fault 	 0x4 : 	 0xc : 	 8  
FPGA Timestamp error counter test result: Passed 11, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last

      Statistics - attachment 4
      Temp - attachment 5
      Bias - attachment 6

05:27 System wide checks

		 Base 		Current 	Difference
aida07 fault 	 0x1472 : 	 0x1474 : 	 2  
White Rabbit error counter test result: Passed 11, Failed 1

Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR

			 Base 		Current 		Difference
aida07 fault 	 0x0 : 	 0x1 : 	 1  
aida12 fault 	 0x4 : 	 0x16 : 	 18  
FPGA Timestamp error counter test result: Passed 10, Failed 2
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last


     Statistics - attachment 7
     Temp - attachment 8
     Bias - attachment 9

05:51 Looking into the bad timestamp messages in the merger e.g.
MERGE Data Link (28348): bad timestamp  5 3 0xc14f8415 0x00eea280 0x0000cd2e70eea280 0x166bcd2e70eea280 0x166bcd2e716c1f80

Looking in the merger source for the bad timestamp message:

            if (op->Time < LastTimeStamp) {

//     invalid time stamp 

               (*StatsMem[TSSEQERR])++;

               (*StatsMem[TSSEQERR+((LinkNum+1)*MAXCOUNTERS)])++;

               sprintf(message_buffer, "bad timestamp  %d %d 0x%08lx 0x%08lx 0x%016llx 0x%016llx 0x%016llx",LinkNum, link_table[LinkNum]->link_state, op->Data, op->Timestamp, INFO4, op->Time, LastTimeStamp);
      
                report_message(MSG_WARNING);   /***************************/

//              LastTimeStamp = op->Time;


So the message is generated when the merger detects a timewarp.

Took the first warning a data block of errors (The first instance of that particular `LastTimeStamp` and c alculated the time difference between the new timestamp and the last timestamp
The time and LastTimestamp were 0x166bccef3f424a50 0x166bccef4f4232e0 respectively
The time difference between them is 268429456 which is 268ms which seems quite a large difference


06:39 AIDA Crashed
      This time all FEEs were responsive but not showing any stats
      When stopping the DAQ all FEEs except aida06 stopped - attachment 10
      Did a reset of the DAQ and all recovered but no stats on aida06 - attachment 11
      Regained DAQ with a powercycle and a complete restart of the AIDA:8115 Merger and TaperServer

      It is worth noting that aida06 is connected to link 5 the data link which had been producing the bad merge messages overnight.
      We have now had it in aida05, aida6 and aida07. Could it be to do with the correlation scaler rate going into these FEES?

      Going through the var/log/messages on aida-3 aida06 rebooted itself at 06:37
      Mar 13 06:37:16 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:918 for /home/Embedded/XilinxLinux/ppc_4xx/rfs/aida06 (/home/Embedded/XilinxLinux/ppc_4xx/rfs)
Mar 13 06:37:18 aidas-gsi xinetd[4578]: START: time-stream pid=0 from=::ffff:192.168.11.6
Mar 13 06:37:32 aidas-gsi rpc.mountd[4497]: authenticated mount request from 192.168.11.6:862 for /home/npg/MIDAS_Releases/23Jan19/MIDAS_200119 (/home/npg/MIDAS_Releases/23Jan19/MIDAS_200119)

Looking in /var/log/messages on aida06 no evidence of a reason why:
Mar 12 23:30:56 aida06 kernel: Trying to free nonexistent resource <0000000007000000-0000000007ffffff>
Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: mem region start 0x7000000 for 0x1000000 mapped at 0xd2380000
Mar 12 23:30:56 aida06 kernel: AIDAMEM: aidamem: driver assigned major number 253
Mar 12 23:31:13 aida06 kernel: xaida: open:
Mar 12 23:31:14 aida06 kernel: AIDAMEM: aidamem_open:
Mar 12 23:33:06 aida06 kernel: xaida: open:
Mar 13 05:37:30 aida06 syslogd 1.4.2: restart.
Mar 13 05:37:30 aida06 kernel: klogd 1.4.2, log source = /proc/kmsg started.
Mar 13 05:37:30 aida06 kernel: Using Xilinx Virtex440 machine description
Mar 13 05:37:30 aida06 kernel: Linux version 2.6.31 (nf@nnlxb.dl.ac.uk) (gcc version 4.2.2) #34 PREEMPT Tue Nov 15 15:57:04 GMT 2011
Mar 13 05:37:30 aida06 kernel: Zone PFN ranges:
Attachment 1: 210313_0138_Stats.png  68 kB  | Hide | Hide all
210313_0138_Stats.png
Attachment 2: 210313_0141_Temp.png  91 kB  | Hide | Hide all
210313_0141_Temp.png
Attachment 3: 210313_0142_Bias.png  9 kB  | Hide | Hide all
210313_0142_Bias.png
Attachment 4: 210313_0355_Stats.png  67 kB  Uploaded Sat Mar 13 03:00:27 2021  | Hide | Hide all
210313_0355_Stats.png
Attachment 5: 210313_0356_Bias.png  6 kB  Uploaded Sat Mar 13 03:00:27 2021  | Hide | Hide all
210313_0356_Bias.png
Attachment 6: 210313_0356_Temp.png  89 kB  Uploaded Sat Mar 13 03:00:27 2021  | Hide | Hide all
210313_0356_Temp.png
Attachment 7: 210313_0528_Stats.png  66 kB  Uploaded Sat Mar 13 04:30:38 2021  | Hide | Hide all
210313_0528_Stats.png
Attachment 8: 210313_0529_Temp.png  91 kB  Uploaded Sat Mar 13 04:30:38 2021  | Hide | Hide all
210313_0529_Temp.png
Attachment 9: 210313_0530_Bias.png  6 kB  Uploaded Sat Mar 13 04:30:38 2021  | Hide | Hide all
210313_0530_Bias.png
Attachment 10: 2210313_0640_AIDAissue.png  99 kB  Uploaded Sat Mar 13 06:15:17 2021  | Hide | Hide all
2210313_0640_AIDAissue.png
Attachment 11: 210313_0647_aida6MMissing.png  65 kB  Uploaded Sat Mar 13 06:15:25 2021  | Hide | Hide all
210313_0647_aida6MMissing.png
ELOG V3.1.4-unknown