AIDA GELINA BRIKEN nToF CRIB ISOLDE CIRCE nTOFCapture DESPEC DTAS EDI_PSA 179Ta CARME StellarModelling DCF K40
  AIDA, Page 26 of 46  ELOG logo
ID Date Authordown Subject
  237   Thu May 26 01:44:45 2016 Patrick Coleman-Smith[HowTo] Use the Multiplicity Trigger Firmware, V8.18, and update the files
Done. Awaiting test. See attachments 1-2.

> The attached firmware file, FEE_GF_Feb16_18.bin, should be saved to /MIDAS/Aida and the FlashPgm.csh file edited
> to load "FEE_GF_Feb16_18.bin"
> 
> Use the Run Control Expert command to update the firmware on all the FEE64s.
> 
> Power cycle the FEE64s to load the new firmware.
> 
> Make a backup copy of the files in the directory /MIDAS/TclHttpd/Html/AIDA/DISC.
> 
> Unzip the files from DISC_25May16.zip and copy them into /MIDAS/TclHttpd/Html/AIDA/DISC to replace the existing
> ones.
> 
> Make a backup copy of the file /MIDAS/TclHttpd/Html/AIDA/LOCAL.tml
> 
> Unzip the file from LOCAL_tml_25May16.zip and copy into /MIDAS/TclHttpd/Html/AIDA/LOCAL to replace the existing
> file.
> 
> 'RESET' and browser refresh the two updated control windows to load the new software.
> -----------------------------------------------------
> 
> To use the Multiplicity Trigger:-
> In the Local Controls select option 5 for the Trigger.
> 
> In the Discriminator window.
> 
> Set the 'Upper Limit'
> Set the 'Lower Limit' ( always greater than 0 ..... or the Trigger will always be present )
> Set the 'Time window'. The number of 10nS clocks. So 4 => 40nS. Register is 0 to 255.
> 
> -----------------------------------------------------
> 
> These registers are not in the Save/Restore set yet.
> 
> Tested on the T9 system using the pulser and setting/clearing Mask bits.
> 
> Please let me know how this works for you.
  239   Thu May 26 10:03:11 2016 Patrick Coleman-Smith[HowTo] Specification Document for MACB including suggested setup.
Attached is the specification for the MACB with the settings currently available.
  240   Thu May 26 14:19:13 2016 Patrick Coleman-Smith[HowTo] Access AIDA documentation
The latest version of the FEE64 software Interface document ( May 2016 ) is available as a .pdf from
http://npg.dl.ac.uk/documents/edoc000/#AIDA. EDOC955

The version available via the pink "documentation" button in the Control browser window is now out of date. This
will be updated at some future time.

Other documents that are of use are there also.
MACB specification.
GREAT Data Format.
  268   Wed Jun 1 12:14:08 2016 Patrick Coleman-Smith[HowTo] React to "Check Clock Status => FEE64 Failure"
If there are more than one failures then check how they map to the MACB HDMI connections.
 Are they all connected to the just one MACB ? => Check the HDMI cable at the MACB Layer Port.
 Are they all connected to several MACBs which have a common "Parent" ? => Check the HDMI cable at the "Parent"
MACB Layer Port.

If there is just one failure then open the Local Controls browser window for the failing FEE64.
If the "Status Register" @1 is not 0x7 then the Waveform ADCs will not operate correctly and ASIC readout will
be compromised.
To remedy try 
A)    Enter 0x2B to "LMK03200 control register" @5 
          then Reload the Page 
          then Enter 0xB
          then Reload the Page and check the status, if 0x7 all ok. 

B)     Enter 0xA to "LMK03200 control register" @5 
          then Reload the Page
          then Enter 0xB
          then Reload the Page
          then enter 0x2B to "LMK03200 control register" @5 
          then Reload the Page 
          then Enter 0xB
          then Reload the Page and check the status, if 0x7 all ok.  
  325   Thu Jul 7 17:16:11 2016 Patrick Coleman-SmithInstall & Run new firmware with diagnostics for the SYNC
Connected to aidas from daresbury.
Started Firefox with windows for Rpi, Merge, Runcontrol.
Power on FEEs from Pi.
All power-up/RESET/SETUP ok.

Change Firmware from 0x17500C15 to 0x16600C1B which adds SYNC counters to the Master and Slave Timestamp logic.
Power cycle using Pi.
nnaida20 failed to start properly. Logged in as Root ok. TclHttpd Server not in place. dmesg showed no start of
XAIDA or XAIDAMEM. Instigated "reboot" by command line not power cycle. All fine after this completed.

Copied /MIDAS/TclHttpd/Html/AIDA to a backup copy.
Changed LOCAL.*, Aida.*, Check.*

System wide checks now has check SYNC counters.

Checked temperatures. Max is nnaida12 with 68.62

SYNC check :- All read 693248.
The Master counts how many SYNC pulses are issued. All Slaves count SYNC pulses received.
The Check uses a Re-Sync pulse generated without telling the Slaves a Re-Sync is due. The Master and Slaves copy
their SYNC counter to a register when the Re-SYNC occurs. These are read back and displayed.

Started the DAQ as setup with no transfer.
Good event rates range from 502Kitems/sec to 594K.

SYNC counter check : All 839680

Stop DAQ/ Enable Transfer/Merge in Pause/Go DAQ.
All the following times are DAQ local.

19:15 Good Events 414k to 484k across all 23 FEEs.

Stop DAQ/change all thresholds to 200/Go DAQ

Good Events 4.2K to 348K

19:38 Sync counter check :- 1653760 all
Temperatures Max nnaida12 69.12

20:01 Temperatures Max nnaida12 69.19
Sync counter check :- 2181120 all

20:19 Temperatures Max nnaida12 69.25
Sync counter check :- 2590720 all

21:09 Temperatures Max nnaida12 69.19
Sync counter check :- 3729408 all

21:23 Toggled Merge Pause. Doesn't SYNC.
Check statistics. SYNC rate ok, SYNC counts uneven. PAUSE range from 0 to 1158.
DAQ stop/zero statistics/restart Merge/setup and configure/Go Merge/Pause Merge/Go DAQ/Toggle Pause
21:43 Merging now at 4943691 items/sec.

Temperatures Max nnaida12 69.25

DL to RIKEN communication link failed.

23:02 Reconnect by firefox in ribfdaq.
Temperatures Max nnaida12 69.31
Merging still active.

23:20 Sync counter check :- 6722560 all
no sysc errors in system check.

23:45 Merger running 5.4M items/sec
Temperatures max nnaida12 69.44

Stop DAQ/Toggle Pause on Merger All ok.
23:51 Sync counter check 7439360 all except nnaida22 which returns -1
Retry Sync counter check 7478272 all except nnaida22 which returns -1

Log in to nnaida22, Servers look ok. AIDAExec running ok. dmesg has nothing extra.
Runcontrol shows nnaida22 as stopped.
Access nnaida22 alone and shows undefined.

Using diagnostics checked able to read registers correctly and read the sync counter register as 7478272.

Retry Sync counter check all 8006655 except nnaida22 which shows 8006388.

Check SYNC error counters and all ok except nnaida22 which has 6363 errors.

Finishing work ..... power off FEE64s using Pi, check a couple with ping from aidas1 .... fine.

Write up ELOG.

Repeat tests on Monday.
  327   Thu Jul 14 16:55:32 2016 Patrick Coleman-SmithChanges to AIDA software

An updated copy of the merger has been downloaded ( merger64.AD ) with a copy of /MIDAS/Merger/AIDA/Startup.

These two enable the output of all of the SYNC data items to the TapeServer.

 

The System Wide Checks browser window has been updated and includes a report of the number SYNC pulses sent by the Master timestamp module and the number of SYNC pulses received by the Slave modules. These should all be the same value.

 

The ASIC4 browser window has been updated to add two actions to the Experts only menu. "Disable all ASICs For Readout"  and "Enable all ASICs for Readout". These two are useful when starting a noisy system. The sequence would be to start the Merger, toggle Pause. Then start the DAQ ( FEE64s ). "Disable all ASICs For Readout" then in the Merger toggle Pause. I found this made starting the merging much more successful. Then "Enable all ASICs for Readout" to run. I shall add similar actions to the Discriminators but will ensure the Mask Patterns set by the Restore are maintained and restored by the "Enable" action.

Before adding this capability it was quite hard to tune the system to guarantee the Merger a full set of SYNC data items at any one time. I have been running the FEE64s with low thresholds, detectors and no HV.

 

  328   Mon Jul 18 12:00:16 2016 Patrick Coleman-SmithReport of remote operation of Aida at RIKEN 11/7 to 14/7

11th July ---------------------------------------------------------------------------------------------------------

09:32 UK time, logged in and power-up all FEEs OK. Setup fine, Temperature Max 12@64

System Checks all fine SYNC count 2243584.

Start Merger and Toggle Pause for no merging.

DAQ start with output enabled.

Toggle Pause in Merger -> “Want First SYNC” with no change.

Tried this several times with the same result.

Checked DAQ statistics => Buffers/sec = 50 to 65, SYNC/sec = 40 to 80  ( should be 384 )

Disabled all ASIC readout and disable all discriminators. DAQ still in GO state.

Fetched new version of ASIC4 with Enable/Disable ASIC readout commands in the “Expert” menu.

SYNC statics now all FEEs are 384/sec.

Merger … Toggle Pause and its fine. 334 items/sec.

DAQ : ASIC4 : Set all slow comparator Thresholds to 250 : Enable ASICs for readout.

Merger now 532811 items/sec.

DAQ SYNC statistics range 307 to 390/sec

LT 22:36 System Checks => Sync counts 6518784 all ok, Sync errors all passed.

LT 23:30 Merger 539826 items/sec. Temperatures => 12@68.12 ; SYNC statistics 361 to 385 ; SYNC counts 8099840 all ok ; Sync errors all ok ; ASIC clock timestamp all ok.

STOP then Power-down all fine.

13th July ----------------------------------------------------------------------------------------------------------------------

Log in and power-up fine.

Backup Merger file ( merge.AD )  and copy latest version from DL. Won’t copy as file already open.

So instead …. Disable ASIC readout, disable Discriminators and run the system.

When Merger is operating , Enable ASIC readout => Merger rate is 8,058,412 items/sec.

LT 19:44 => Temperatures max 12@68.81 , SYNC received 845824 all ok; Sync errors all ok.

LT 20:11 => Merge rate 7,991,198 items/sec, SYNC received 1551359 all ok; Sync errors all ok.

LT 21:07 => Temperatures Max 12@69.00 ; SYNC received 2742272 all.

LT 23:20 => Stopped the whole system.

Kill the Merge process and the Startup process. Copy new Merge.AD successful. Edit Merger startup script to include netint OutputAll 1.

Start up Merger using /MIDAS/Merger/AIDA/Startup in terminal connected to aidas1.

Merger : Setup and configure ; Go ; Toggle Pause All Ok.

Go DAQ   - disable all ASIC readout.

Merger : Toggle Pause ; Merging 8832 items/sec ( = 384 x 23 …. OK )

Connect to TapeServer setup for R3 into /TapeData/July2016 , Go Tape ok.

Merger : Toggle Transfer.

LT 00:08 => enable ASICS for readout. Run for a short while. Merge spinner hesitates for 2 seconds at a time. Merge rate 5,951,471 items/sec. Tape server rate 3,5148 Kbytes/sec.

Stop system, SYNC received 8085504 all ok ; sync error all ok. Power down. Sftp tape data files to nndhcp052 in Daresbury.

14th July -------------------------------------------------------------------------------------------------------------------------

Startup and run the system to tape again R4, ASICs with no discriminators. All fine.

LT 20:19 Enable Discriminators. Merger rate 3.9Mi/sec ( Mega items/second)

LT 20:57 Temperature Max 12@69 , Merge rate 4.2Mi/sec

UK 14:49 Network failure between Daresbury and Edinburgh. Lost contact repeatedly 4 times in a row. Contacted DL network support. No local networking faults.

LT 23:05 Re-connected to Aida in RIKEN. Merger 3.3Mi/sec.

Decided to stop due to concerns about flaky network and leaving the system powered up over the weekend.

SYNC received 7222272 all ok; sync error ok.

Power down. R4_244 is the last file. Written.

 

  331   Thu Jul 28 11:35:58 2016 Patrick Coleman-Smith[HowTo] Apply Thermal paste to FPGA
  333   Mon Aug 1 11:30:07 2016 Patrick Coleman-SmithFirst time remotely power up HV in RIKEN from Daresbury

Logged into aidas1 and opened the nnrpi1 relay control window ... powered up the FEE64s.

seperatly connected to ribfdaq and opened a firefox to aidas1:8015.

By the time the firefox had appeared on my desktop the FEE64s had started up.

Ran through the standard startup. Set thresholds to 255 for all 3 sets of comparators. Disabled all discriminators.

Enable histograms.

Started DAQ.

Using ELOG 74 as a guide logged into nnrpi1 and ran two putty sessions ( USB2 and USB3 ).

Based on the instructions in the HV unit manual attached to ELOG75 powered ion all 6 detectors ( current when settled at 100v = 3 to 5uA )

Using spectrum broswer at nnaida6 and ASIC4 to control the slow comparator threshold ..... needed to reach 32 before the pulser became visible.

Using nnaida6:1.3.L found the pulser peak at channel 32136 with a peak width of 67.73 channels.

Then poweroff the HV on all 6 detectors.

Power off the FEE64s.

Logout of all RIKEN machines and consider the best way to take advantage of the situation which allows the system to be operated completely remotely.

 

 

  334   Mon Aug 15 09:55:08 2016 Patrick Coleman-SmithFound a candidate for DAQ failure.

The problem :-

The DAQ stops working and can only be fixed by a reboot of the FEE64s.

The Candidate:-

The DAQ program running in the FEE64s which handles the transfer of the data items from the FPGA to the Merger, AidaExecV8, uses a device driver, aidamem, to copy the data from DMA memory into Linux memory.

Sometimes AidaExecV8 can be killed by the Kernal due to a "page allocation failure" which occurs when the Kernal memory space has become fragmented. So when a block of contiguous memory is requested to receive the copy of the data from the FPGA DMA memory the Kernal can't allocate memory and kills the process.

The effect of this would be seen at the Merger where it would be waiting for data from the FEE.

Since the AidaExecV8 has been killed then there will be no response to status requests.

A recent change to the operation of the Aida system , flushing buffers regularly, is the most likely cuplrit. The flush of the buffers on a slow FEE will use small memory blocks while faster FEE data will always require full size buffers and mean that the Kernal memory will have large contiguous areas of memory in constant use. This will explain why this failure type is relatively recent .

A solution :- Change the request for memory copy to always use the maximum size. This way there shouldn't be the fragmentation of the memory. Also the fragmentation of memory will be investigated to see if there is a away to monitor it.

 

Some corroboration:-

The error messages from the Kernal that indicate this has fault has occured can be read from the FEE64 root file system. At the text file /var/log/messages. A grep of these in RIKEN using the phrase "page allocation failure" shows they have occured. It remains to see if the date and times of the failures align with the date and times of the system failure.

 

All comments and ideas gratefuly received.

###### # ##  # # # # #   Tested this solution on T9 system and it doesn't work ! The failure still occurs when the DAQ is set to write all input to disc.

  335   Wed Aug 17 14:10:05 2016 Patrick Coleman-SmithLinux Memory problem solutions for FEE64

The problem indicated by "page allocation failure" and the kernal subsequently "killing" the AidaExecV8 process is due ( it appears ) to memory availability.

Solution 1:- Reduce the amount of memory used for histogramming. That would be by disabling the .V histograms or changing the size of the .H and/or .L histograms from 65536 to 32768 and using the shift attribute in the Options. This will reduce the memory required.

Solution 2:- Change a kernal setting in the /etc/sysctl.conf to request there is always a minimum amount of free memory. Insert the line "vm.min_free_kbytes=4096" in the file.

Solution 2 has been tried on the system in T9 while the flag enabling raw data to be written to disc is set. Previous to the solution, with the same operating conditions,  the fault would occur after about 15 minutes. Subsequent to the solution the system has run for 46 hours.

I propose that the .V histograms are disabled and the kernal setting is used. ( both solutions :-) The .V histograms are a late addition and not required at present.

 

  336   Wed Aug 17 14:44:52 2016 Patrick Coleman-Smith[HowTo] start acquisition using the new "Start ASIC Readout" button

Due to sometimes operating with very noisy systems, both at Daresbury and RIKEN, I have installed a change to the startup of the system.

The Runcontrol GO button now only starts the output of SYNC pulses. A further button labelled "Start ASIC Readout" ( see attached photo ) needs to be pressed to start the ASIC ADC, Discriminator and Correlation readout.

After carrying out the normal operations to start the system and pressing the GO button the Merger can be checked to be certain it is merging the events. Since there are only SYNC pulses it should soon be apparent if this is not happening and appropriate actions can be taken. ( Kill/Reload the Link and Merger processes using the "Merger for AIDA" icon at the top of the screen is one way. )

Once it is certain the merging is in progress then the new button labelled "Start ASIC Readout" ( see attached photo ) needs to be pressed to start the ASIC ADC, Discriminator and Correlation readout.

This must be done after every GO.

  343   Mon Aug 22 11:45:07 2016 Patrick Coleman-SmithSoftware updated in Aida

Logged into aidas1, installed AidaExecV8 with a test of the time difference between SYNC data items. Failure to be 0x40000 clocks will cause a statistic to be incremented. "SYNC Time Warp"

Installed updated versions of /MIDAS/TclHttpd/Html/AIDA/RunControl/sys.tcl and /MIDAS/TclHttpd/Html/AIDA/RunControl/implementation.tcl to allow latest System Wide Check to operate.

Installed updated versions of Check.tcl/.js/tml into /MIDAS/TclHttpd/Html/AIDA/Check/

Checked all the /MIDAS/linux-ppc_4xx/startup files including aidacommon for entries relating to the flush and push rates. Only aidacommon has an entry as required. Currently both are set to 30.

Changed the /etc/sysctl.conf file in each of the 32 root file systems (rfs) in /MIDAS/XilinxLinux/ppc_4xx/rfs/nnaida##/ to add the line "vm.min_free_kbytes=4096".

The changes will take effect from the next power-cycle of the FEEs and "reset" of the browser windows on the server.

  359   Wed Sep 7 16:11:25 2016 Patrick Coleman-SmithUpdated layout.txt file in aidas1

Used the layout reported in ELOG357 and updated the layout.txt file used in the "Layout and Status" browser window.

The file is in /MIDAS/config/TclHttpd/aidas1

The browser window "Layout and Status" is accessed from the Control browser window.

It is intended to be used to determine if the Embedded Linux is operating in the FEE64. Each FEE64 in the layout.txt file is pinged with a timeout of 1 second. The layout then shows red for no response and green for a response.

  465   Wed Nov 23 14:50:59 2016 Patrick Coleman-Smith[ Info ] What does the System Wide Clock check mean when it fails.

The System Wide check called "Check Clock Status"  reads the status register from the FEE64 units in the system and compares the value of each bit against a template depending if the FEE64 is Master or not.

The bit fields have the following meaning :-

  1. clkd_ld1_pin. This is the "locked" status of the PLL in clock distribution chip LMK03200 #1. The clocks are used for the Waveform ADCs. ADCs 1 to 4 and FPGA Waveform decode logic.
  2. clkd_ld2_pin. This is the "locked" status of the PLL in clock distribution chip LMK03200 #2. The clocks are used for the Waveform ADCs. ADCs 5 to 8 and the Master SYNC PLL in the FPGA.
  3. aq_clock_locked. This is the "locked" status of the PLL in the FPGA. The clock is used for all data aquisition functions. If this isn't true ('1') then the module is not going to work as part of the system.
  4. sync_locked. This is the "locked" status of the PLL in the FPGA which provides a 200MHz clock to the SYNC pulse alignement logic. This is only used in the Master.
  5. to 31 read as '0' 

The System Wide Check will only report the state of bits 0 to 2 but by opening the "Local Controls" browser window the status value can be seen ( after a reload ) at offset 1.

 

  530   Wed Jan 18 14:16:47 2017 Patrick Coleman-SmithRepaired Modules returning to RIKEN

Five modules are being returned to RIKEN after repair.

The MAC addresses are :-

00:04:a3:2a:ED:8f 

00:04:a3:2a:f6:d4

00:04:a3:2b:22:6e

00:04:a3:2b:33:15

00:04:a3:2a:d0:1a

00:04:a3:2b:33:0c

00:04:a3:2a:b6:45

00:04:a3:2a:b2:b2

00:04:a3:2b:09:da

00:04:a3:2b:11:c5

 

 

  559   Mon May 8 10:00:02 2017 Patrick Coleman-Smith[HowTo] Operating document for the Pi_Monitor FEE64 console logging equipment

The operating proceedure for the Raspberry Pi FEE64 console monitor is attached.

Please let me know if more detail is required.

 

  575   Thu May 18 15:20:44 2017 Patrick Coleman-Smith[HowTo] Connecting a USB console interface to the FEE64

The three attachments show the orientation of the console cable pcb relative to the FEE64 connectors.

The first with the other cables and the other two with no cables for clarity of position.

  595   Wed May 24 13:39:12 2017 Patrick Coleman-SmithExamination of Console Logs -- and a couple of reminders

I have tarred and zipped all 24 logs and shipped them back to the UK. Now I can browse through them.

Studying the "panics" :=

       nnaida17 has one which is an NFS mount failure.

       nnaida20 and 22 have one each which are "Starting midas:  Page fault in user mode with in_atomic() = 1 mm = c61bf200" but with different values for the mm=.

I am studying the web to understand why this may occur in a particular program. Both occur at similar places  after "Starting midas".

There are reports of "Clock not locked. status = 0xc" during Setup Electronics which is a bit unusual. These occur in nnaida4, 17 and 22.

Also   "Clock not locked. status = 0xd"  in nnaida23,17,3,19,6,5,14,7,9. Further investigation to understand all these.

The bit allocation of the Clock Status word is:- 

Bit 0 : Lock Detect bit from LMK03200 #1

Bit 1 : Lock Detect bit from LMK03200 #2

Bit 2 : Lock detect from the internal DCM for the mux clock.

Bit 3: iDelay ready signal

The LMK03200  are the two PLLs which are connected to the 50MHz input clock. The clock setup in the log directly before the error report for nnaiad17 had reported correctly setup and the ADCs had calibrated, these rely on a stable 50MHz from the LMK03200s.

 

I have also spotted some diagnostic information about the waveform readout that is useful.

 

Just a reminder..... when power cycling the equipment please leave the power off for at least 10 seconds to allow all the capacitors in the supply chain to discharge.

Also note that the console monitor window in the Pi can be used to check that all FEE64s have "finished" the power-on sequence by using the "parse" button.

Since it doesn't use the DAQ software it won't lock-up and it will give an indication of progress.

  597   Wed May 24 14:51:06 2017 Patrick Coleman-SmithExamination of Console Logs -- and a couple of reminders

 

Quote:

I have tarred and zipped all 24 logs and shipped them back to the UK. Now I can browse through them.

Studying the "panics" :=

       nnaida17 has one which is an NFS mount failure.

Do we understand why? Are there different/additional NFS mount options to be considered? 

       nnaida20 and 22 have one each which are "Starting midas:  Page fault in user mode with in_atomic() = 1 mm = c61bf200" but with different values for the mm=.

I am studying the web to understand why this may occur in a particular program. Both occur at similar places  after "Starting midas".

There are reports of "Clock not locked. status = 0xc" during Setup Electronics which is a bit unusual. These occur in nnaida4, 17 and 22.

Also   "Clock not locked. status = 0xd"  in nnaida23,17,3,19,6,5,14,7,9. Further investigation to understand all these.

The bit allocation of the Clock Status word is:- 

Bit 0 : Lock Detect bit from LMK03200 #1

Bit 1 : Lock Detect bit from LMK03200 #2

Bit 2 : Lock detect from the internal DCM for the mux clock.

Bit 3: iDelay ready signal

The LMK03200  are the two PLLs which are connected to the 50MHz input clock. The clock setup in the log directly before the error report for nnaiad17 had reported correctly setup and the ADCs had calibrated, these rely on a stable 50MHz from the LMK03200s.

 

I have also spotted some diagnostic information about the waveform readout that is useful.

 

Just a reminder..... when power cycling the equipment please leave the power off for at least 10 seconds to allow all the capacitors in the supply chain to discharge.

We do - standard procedure is to wait 20s.

Also note that the console monitor window in the Pi can be used to check that all FEE64s have "finished" the power-on sequence by using the "parse" button.

Since it doesn't use the DAQ software it won't lock-up and it will give an indication of progress.

We do check for further panics. I assume we would have to read each system log to check that the boot sequence and app load had completed?

 

ELOG V3.1.3-7933898