AIDA
GELINA
BRIKEN
nToF
CRIB
ISOLDE
CIRCE
nTOFCapture
DESPEC
DTAS
EDI_PSA
179Ta
CARME
StellarModelling
DCF
K40
AIDA
Draft saved at 00:00:00
Fields marked with
*
are required
Entry time:
Mon Aug 15 09:55:08 2016
Author
*
:
Subject
*
:
<p>The problem :-</p> <p>The DAQ stops working and can only be fixed by a reboot of the FEE64s.</p> <p>The Candidate:-</p> <p>The DAQ program running in the FEE64s which handles the transfer of the data items from the FPGA to the Merger, AidaExecV8, uses a device driver, aidamem, to copy the data from DMA memory into Linux memory.</p> <p>Sometimes AidaExecV8 can be killed by the Kernal due to a "page allocation failure" which occurs when the Kernal memory space has become fragmented. So when a block of contiguous memory is requested to receive the copy of the data from the FPGA DMA memory the Kernal can't allocate memory and kills the process.</p> <p>The effect of this would be seen at the Merger where it would be waiting for data from the FEE.</p> <p>Since the AidaExecV8 has been killed then there will be no response to status requests.</p> <p>A recent change to the operation of the Aida system , flushing buffers regularly, is the most likely cuplrit. The flush of the buffers on a slow FEE will use small memory blocks while faster FEE data will always require full size buffers and mean that the Kernal memory will have large contiguous areas of memory in constant use. This will explain why this failure type is relatively recent .</p> <p>A solution :- Change the request for memory copy to always use the maximum size. This way there shouldn't be the fragmentation of the memory. Also the fragmentation of memory will be investigated to see if there is a away to monitor it.</p> <p> </p> <p>Some corroboration:-</p> <p>The error messages from the Kernal that indicate this has fault has occured can be read from the FEE64 root file system. At the text file /var/log/messages. A grep of these in RIKEN using the phrase "page allocation failure" shows they have occured. It remains to see if the date and times of the failures align with the date and times of the system failure.</p> <p> </p> <p>All comments and ideas gratefuly received.</p> <p>###### # ## # # # # # Tested this solution on T9 system and it doesn't work ! The failure still occurs when the DAQ is set to write all input to disc. </p>
Encoding
:
HTML
ELCode
plain
Suppress Email notification
Resubmit as new entry
Attachment 1:
Drop attachments here...
Draft saved at 00:00:00
ELOG V3.1.4-unknown