16th May 08:00 - 12:00 shift
Author: MS
08:00
FEE64 module aida09 global clocks failed, 6
Clock status test result: Passed 15, Failed 1
Understand status as follows
Status bit 3 : firmware PLL that creates clocks from external clock not locked
Status bit 2 : always logic '1'
Status bit 1 : LMK3200(2) PLL and clock distribution chip not locked to external clock
Status bit 0 : LMK3200(1) PLL and clock distribution chip not locked to external clock
If all these bits are not set then the operation of the firmware is unreliable
FEE64 module aida09 failed
Calibration test result: Passed 15, Failed 1
If any modules fail calibration , check the clock status and open the FADC Align and Control browser page to rerun calibration for that module
Base Current Difference
aida01 fault 0xf294 : 0xf296 : 2
aida02 fault 0xd8ec : 0xd8ee : 2
aida03 fault 0xf001 : 0xf003 : 2
aida04 fault 0xd992 : 0xd994 : 2
aida05 fault 0x714c : 0x7163 : 23
aida06 fault 0x5a49 : 0x5a4a : 1
aida07 fault 0x5aca : 0x5acb : 1
aida08 fault 0xb92e : 0xb92f : 1
White Rabbit error counter test result: Passed 8, Failed 8
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida05 fault 0x0 : 0xa : 10
aida12 fault 0x0 : 0x3 : 3
aida13 fault 0x0 : 0x4d : 77
FPGA Timestamp error counter test result: Passed 13, Failed 3
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Returned 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mem(KB) : 4 8 16 32 64 128 256 512 1k 2k 4k
aida01 : 27 7 2 3 2 3 2 4 2 3 7 : 40228
aida02 : 3 8 3 2 2 2 2 4 2 3 7 : 39996
aida03 : 23 9 6 1 0 3 2 4 2 3 7 : 40100
aida04 : 34 27 17 7 2 4 3 3 2 2 7 : 38608
aida05 : 19 9 5 2 2 2 3 3 2 3 7 : 39844
aida06 : 5 5 2 1 0 4 2 4 2 3 7 : 40060
aida07 : 28 2 10 3 2 4 1 4 2 3 7 : 40192
aida08 : 19 5 5 1 2 5 1 3 2 3 7 : 39652
aida09 : 20 8 3 3 1 3 2 3 2 3 7 : 39648
aida10 : 27 10 0 2 2 2 2 3 1 4 7 : 40572
aida11 : 3 3 2 1 2 4 2 3 2 3 7 : 39652
aida12 : 16 7 11 2 3 4 2 2 2 3 7 : 39464
aida13 : 14 10 4 3 1 5 2 2 2 3 7 : 39400
aida14 : 21 9 10 1 1 2 2 3 1 4 7 : 40604
aida15 : 19 8 2 3 0 4 3 2 2 3 7 : 39436
aida16 : 24 5 8 3 3 4 2 2 2 3 7 : 39464
*** Timestamp elapsed time: 225.065 s
FEE elapsed dead time(s) elapsed idle time(s)
0 0.038 0.000
1 9.479 0.000
2 0.195 0.000
3 5.921 0.000
4 0.000 11.742
5 0.036 0.000
6 0.013 0.000
7 0.498 0.000
8 0.436 0.000
9 0.000 107.300
10 2.787 0.000
11 0.905 0.000
12 0.831 0.000
13 0.000 55.939
14 0.080 0.000
15 0.267 0.000
16 0.000 0.000
17 0.000 0.000
18 0.000 0.000
19 0.000 0.000
20 0.000 0.000
21 0.000 0.000
22 0.000 0.000
23 0.000 0.000
24 0.000 0.000
25 0.000 0.000
26 0.000 0.000
27 0.000 0.000
28 0.000 0.000
29 0.000 0.000
30 0.000 0.000
31 0.000 0.000
32 0.000 0.000
10:00
FEE64 module aida06 global clocks failed, 6
FEE64 module aida09 global clocks failed, 6
Clock status test result: Passed 14, Failed 2
Understand status as follows
Status bit 3 : firmware PLL that creates clocks from external clock not locked
Status bit 2 : always logic '1'
Status bit 1 : LMK3200(2) PLL and clock distribution chip not locked to external clock
Status bit 0 : LMK3200(1) PLL and clock distribution chip not locked to external clock
If all these bits are not set then the operation of the firmware is unreliable
FEE64 module aida09 failed
Calibration test result: Passed 15, Failed 1
If any modules fail calibration , check the clock status and open the FADC Align and Control browser page to rerun calibration for that module
Base Current Difference
aida01 fault 0xf294 : 0xf296 : 2
aida02 fault 0xd8ec : 0xd8ee : 2
aida03 fault 0xf001 : 0xf003 : 2
aida04 fault 0xd992 : 0xd994 : 2
aida05 fault 0x714c : 0x7166 : 26
aida06 fault 0x5a49 : 0x5a4a : 1
aida07 fault 0x5aca : 0x5acb : 1
aida08 fault 0xb92e : 0xb92f : 1
White Rabbit error counter test result: Passed 8, Failed 8
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida05 fault 0x0 : 0xa : 10
aida12 fault 0x0 : 0x3 : 3
aida13 fault 0x0 : 0x4d : 77
FPGA Timestamp error counter test result: Passed 13, Failed 3
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Returned 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mem(KB) : 4 8 16 32 64 128 256 512 1k 2k 4k
aida01 : 19 7 6 2 2 3 2 3 2 3 7 : 39716
aida02 : 8 4 3 2 2 3 2 3 2 3 7 : 39600
aida03 : 23 11 5 1 0 4 2 3 1 4 7 : 40740
aida04 : 42 25 16 7 2 4 4 3 2 2 7 : 38864
aida05 : 24 5 5 1 2 2 3 3 1 4 7 : 40824
aida06 : 11 4 4 1 1 4 2 4 2 3 7 : 40172
aida07 : 21 8 5 1 2 3 2 4 2 3 7 : 40196
aida08 : 15 5 6 1 1 4 2 4 2 3 7 : 40228
aida09 : 15 10 4 1 2 3 2 3 2 3 7 : 39660
aida10 : 23 8 2 2 2 2 2 3 1 4 7 : 40572
aida11 : 6 4 4 2 2 4 2 3 2 3 7 : 39736
aida12 : 18 8 8 1 4 4 3 2 2 3 7 : 39720
aida13 : 23 6 2 3 2 5 2 2 2 3 7 : 39436
aida14 : 29 3 11 1 1 2 2 3 1 4 7 : 40604
aida15 : 29 7 2 2 0 4 3 3 2 3 7 : 39948
aida16 : 2 3 6 2 3 4 2 2 2 3 7 : 39296
*** Timestamp elapsed time: 225.065 s
FEE elapsed dead time(s) elapsed idle time(s)
0 0.038 0.000
1 9.479 0.000
2 0.195 0.000
3 5.921 0.000
4 0.000 11.742
5 0.036 0.000
6 0.013 0.000
7 0.498 0.000
8 0.436 0.000
9 0.000 107.300
10 2.787 0.000
11 0.905 0.000
12 0.831 0.000
13 0.000 55.939
14 0.080 0.000
15 0.267 0.000
16 0.000 0.000
17 0.000 0.000
18 0.000 0.000
19 0.000 0.000
20 0.000 0.000
21 0.000 0.000
22 0.000 0.000
23 0.000 0.000
24 0.000 0.000
25 0.000 0.000
26 0.000 0.000
27 0.000 0.000
28 0.000 0.000
29 0.000 0.000
30 0.000 0.000
31 0.000 0.000
32 0.000 0.000
12:00-16:00
16th May 12:00 - 16:00 shift
Author: JS
11:57 Taking over from Magda. Running full checks.
usbec ok. Max ~1700 Hz 1MHz on DSSD1, DSSD ~ 75%
Current ok 06.410 uA 006.835 uA
Stats good 1Statistics aidas-gsi(6).png
Temps ok 1Temperature and status scan aidas-gsi(6).png
Analysis ok R7_385. Dead time FEE1 a little hight 6%
PAUSE: 166 RESUME: 166
*** Timestamp elapsed time: 196.305 s
FEE elapsed dead time(s) elapsed idle time(s)
0 0.044 0.000
1 12.405 0.000
2 0.807 0.000
3 7.260 0.000
4 0.014 0.000
5 0.458 0.000
6 0.027 0.000
7 0.497 0.000
8 0.723 0.000
9 0.000 88.857
10 6.189 0.000
11 1.443 0.000
12 0.474 0.000
13 0.000 35.565
14 0.000 0.000
15 0.147 0.000
16 0.000 0.000
17 0.000 0.000
18 0.000 0.000
19 0.000 0.000
20 0.000 0.000
21 0.000 0.000
22 0.000 0.000
23 0.000 0.000
24 0.000 0.000
25 0.000 0.000
26 0.000 0.000
27 0.000 0.000
28 0.000 0.000
29 0.000 0.000
30 0.000 0.000
31 0.000 0.000
32 0.000 0.000
FEE64 module aida09 global clocks failed, 6
Clock status test result: Passed 15, Failed 1
FEE64 module aida09 failed
Calibration test result: Passed 15, Failed 1
Base Current Difference
aida01 fault 0xf294 : 0xf296 : 2
aida02 fault 0xd8ec : 0xd8ee : 2
aida03 fault 0xf001 : 0xf003 : 2
aida04 fault 0xd992 : 0xd994 : 2
aida05 fault 0x714c : 0x716e : 34
aida06 fault 0x5a49 : 0x5a4a : 1
aida07 fault 0x5aca : 0x5acb : 1
aida08 fault 0xb92e : 0xb92f : 1
White Rabbit error counter test result: Passed 8, Failed 8
Base Current Difference
aida05 fault 0x0 : 0xa : 10
aida12 fault 0x0 : 0x3 : 3
aida13 fault 0x0 : 0x4d : 77
FPGA Timestamp error counter test result: Passed 13, Failed 3
Returned 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mem(KB) : 4 8 16 32 64 128 256 512 1k 2k 4k
aida01 : 20 7 4 2 2 3 2 3 1 4 7 : 40712
aida02 : 25 10 7 1 1 4 1 3 2 3 7 : 39556
aida03 : 22 10 5 1 0 4 2 4 2 3 7 : 40216
aida04 : 40 24 17 7 2 5 3 3 2 2 7 : 38736
aida05 : 4 8 3 0 1 3 3 3 1 4 7 : 40768
aida06 : 21 6 6 2 2 3 2 4 2 3 7 : 40228
aida07 : 25 4 6 3 2 3 2 4 2 3 7 : 40260
aida08 : 19 9 4 1 1 4 2 4 2 3 7 : 40244
aida09 : 16 9 4 2 2 3 2 3 2 3 7 : 39688
aida10 : 19 10 5 1 2 2 2 4 1 4 7 : 41100
aida11 : 21 10 2 1 2 3 3 3 2 3 7 : 39908
aida12 : 25 7 7 2 2 3 2 3 2 3 7 : 39756
aida13 : 13 7 3 4 2 5 2 3 2 3 7 : 39964
aida14 : 21 7 9 2 1 2 2 4 1 4 7 : 41116
aida15 : 23 6 2 3 0 4 3 3 2 3 7 : 39948
aida16 : 9 11 10 2 3 4 2 2 2 3 7 : 39452
Tom says Aida09 clock fail is ok as its status bit is "6".
The large white difference for FEE5 is known and has been determined to be ok, a post run investigation will be undertaken.
12:35 -
usbec ok.
Current ok
Stats good
Temps ok
Analysis ok R7_395. Dead time FEE1 still high 6.6%
13:33 -
usbec ok. Max implants ~ 1.8kHz
Current ok 006.850 uA 007.250 uA
Stats - Aida11 runing low < 5k was ~20k overnight
Temps ok
Analysis R7_415. Dead time FEE1 & FEE10 high 10%
14:00
usbec ok - ucesb1.png
Current ok - bias1.png
Stast - Aida11 low -
14:20 aida fee rebooted itself. A powercycle was performed. Upon reboot we are seeing extremely large amounts of noise in the FEEs. Looking at the waveforms we have very large 100kHz pick up in the FEEs. This has resulted in 50% deadtime in many FEEs including the p+n.
15:28 Because of extremely high rates across all FEEs have decided to do a powercycle. Before restarting the FEEs will give them a couple of minutes to cool.
18:00 Since the start of the shift we have been trying to recover the system froma large increase in noise following the crash at ~14:00.
During this time NH has entered the area and inspected the system and also grounded the AIDA snout. This provided us with some improvement on the noise.
The rates are still slightly above where we were before the crash but now appear stable. To counteract the dead time in the n+n strips we have raised the threshold to 0x64 for ASIC4 in all FEEs.
We are now running with around 10% deadtime on FEE4 and less elsewhere for n+n. For p+n most have zero dead time apart from FEE11 which is still noisy.
During the time we were trying to recover the system screenshots were taken of the waveforms. He it could be seen that the 100kHz noise was very apparent. Particularly in the n+n strips.
18:08 System wide checks all ok - bar some ADC but waveforms disabled
Statistics ok - 210516_1809_Stats
Temperatures ok - 210516_1809_Temp
Bias and leakage ok - 210516_1810_Bias
18:37 Performed an ASIC check and now the rates have dropped in all n+n strips. Currently very small amounts of dead time
18:40 Realised this was because it raised the threshold of all strips to 0x64 on the n+n side.
19:25 Removed S452 from 1e2.... drive. Before removing checked with Nic backed up to Lustre. Also verified four ourselves.
Now have around 4.2TB left which will provide around 80 hours of writing
20:16 We noticed iptraf was using around 30% CPU usage. We investigated whether it had any effect on the dead time but from what we have seen it has not.
20:57 System wide checks. Clock ok
Base Current Difference
aida05 fault 0x1552 : 0x1556 : 4
White Rabbit error counter test result: Passed 15, Failed 1
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida05 fault 0x0 : 0x1 : 1
FPGA Timestamp error counter test result: Passed 15, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Statistics ok - 210516_2056_Stats
Temp ok - 210516_2058_Temp
Bias and leakage current ok - 210516_2058_Bias
23:16 System wide checks:
Clock still ok
Base Current Difference
aida05 fault 0x1552 : 0x155a : 8
White Rabbit error counter test result: Passed 15, Failed 1
Understand the status reports as follows:-
Status bit 3 : White Rabbit decoder detected an error in the received data
Status bit 2 : Firmware registered WR error, no reload of Timestamp
Status bit 0 : White Rabbit decoder reports uncertain of Timestamp information from WR
Base Current Difference
aida05 fault 0x0 : 0x2 : 2
FPGA Timestamp error counter test result: Passed 15, Failed 1
If any of these counts are reported as in error
The ASIC readout system has detected a timeslip.
That is the timestamp read from the time FIFO is not younger than the last
Statistics - 210516_2315_Stats
Temperature - 210516_2316_Temp
Bias and leakage current ok - 210516_231 |