AIDA GELINA BRIKEN nToF CRIB ISOLDE CIRCE nTOFCapture DESPEC DTAS EDI_PSA 179Ta CARME StellarModelling DCF K40
  AIDA  ELOG logo
Message ID: 334     Entry time: Mon Aug 15 09:55:08 2016
Author: Patrick Coleman-Smith 
Subject: Found a candidate for DAQ failure. 

The problem :-

The DAQ stops working and can only be fixed by a reboot of the FEE64s.

The Candidate:-

The DAQ program running in the FEE64s which handles the transfer of the data items from the FPGA to the Merger, AidaExecV8, uses a device driver, aidamem, to copy the data from DMA memory into Linux memory.

Sometimes AidaExecV8 can be killed by the Kernal due to a "page allocation failure" which occurs when the Kernal memory space has become fragmented. So when a block of contiguous memory is requested to receive the copy of the data from the FPGA DMA memory the Kernal can't allocate memory and kills the process.

The effect of this would be seen at the Merger where it would be waiting for data from the FEE.

Since the AidaExecV8 has been killed then there will be no response to status requests.

A recent change to the operation of the Aida system , flushing buffers regularly, is the most likely cuplrit. The flush of the buffers on a slow FEE will use small memory blocks while faster FEE data will always require full size buffers and mean that the Kernal memory will have large contiguous areas of memory in constant use. This will explain why this failure type is relatively recent .

A solution :- Change the request for memory copy to always use the maximum size. This way there shouldn't be the fragmentation of the memory. Also the fragmentation of memory will be investigated to see if there is a away to monitor it.

 

Some corroboration:-

The error messages from the Kernal that indicate this has fault has occured can be read from the FEE64 root file system. At the text file /var/log/messages. A grep of these in RIKEN using the phrase "page allocation failure" shows they have occured. It remains to see if the date and times of the failures align with the date and times of the system failure.

 

All comments and ideas gratefuly received.

###### # ##  # # # # #   Tested this solution on T9 system and it doesn't work ! The failure still occurs when the DAQ is set to write all input to disc.

ELOG V3.1.4-unknown