[SNMP4J] TableUtils - GetBulk responses - Possible race condition

Frank Fock fock at agentpp.com
Sat Mar 15 13:39:41 CET 2008


Hi Damian,

You can workaround the problem by:

(a) increase the timeout of the requests.
(b) set the number of retries to 0.

You can also try out the nightly build
(must be SNMP4J_20080316_*_snapshot.zip)
from http://www.oosnmp.com/~katz/snapshot/

The bug is a bit different than you thought:
The agent just responds while SNMP4J sends
out a retry (therefore the same request ID).
Unfortunately, the response of the second
request is also being processed because the
first request has not yet been fully removed
from the outstanding request table. I have
improved the synchronization to avoid such
race condition in the future.

Best regards,
Frank

Damian wrote:
> Hi Frank,
> 
> I tried with version 1.9 and I am still able to replicate it.
> 
> I added the following instrumentation to TableUtils.java:
> 
> ***************
> *** 305,306 ****
> --- 305,307 ----
>      private LinkedList rowCache = new LinkedList();
> +     private Vector requestIds = new Vector();
>      protected Vector lastReceived;
> ***************
> *** 399,400 ****
> --- 400,402 ----
>          PDU response = event.getResponse();
> +         requestIds.add(response.getRequestID());
>          int cols = request.size();
> ***************
> *** 466,467 ****
> --- 468,470 ----
>                                                   
> TableEvent.STATUS_WRONG_ORDER));
> +                 logger.error("OIDs in wrong order. Request IDs 
> processed by TableUtils.TableRequest.onResponse: " + requestIds);
>                  return;
> 
> This is the log message I got that makes me believe that the stack is 
> bubbling up the same request id twice:
> 
> 2008-03-14 17:10:24,931 ERROR [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.error - OIDs in wrong order. Request IDs processed by 
> TableUtils.TableRequest.onResponse: [690986669, 690986671, 690986675, 
> 690986679, 690986683, 690986687, 690986690, 690986694, 690986698, 
> 690986702, 690986706, 690986710, 690986714, 690986718, 690986722, 
> 690986726, 690986730, 690986734, 690986738, 690986742, 690986746, 
> 690986749, 690986754, 690986757, 690986760, 690986764, 690986764]
> 
> In this message, we can see request-id 690986764 (29:2f:9f:0c) being 
> processed twice and I think that is the probable cause of the 
> STATUS_WRONG_ORDER.
> 
> I am attaching all the logs generated by snmp4j 1.9 in my test. I am 
> also attaching the modified version of TableUtils.java that I used.
> 
> These are the log messages that I believe pertain to the problem:
> 
> 2008-03-14 17:10:24,192 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Running pending async request with handle 
> PduHandle[690986764] and retry count left 3
> 2008-03-14 17:10:24,193 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Sending message to 172.17.1.45/161 with length 
> 101: 30:63:02:01:01:04:06:70:75:62:6c:69:63:a5:56:02:04:29:2f:9f:0c 
> (cont) ...
> 2008-03-14 17:10:24,393 DEBUG [Timer-0] Log4jLogAdapter.debug - Running 
> pending async request with handle PduHandle[690986764] and retry count 
> left 2
> 2008-03-14 17:10:24,394 DEBUG [Timer-0] Log4jLogAdapter.debug - Sending 
> message to 172.17.1.45/161 with length 101: 
> 30:63:02:01:01:04:06:70:75:62:6c:69:63:a5:56:02:04:29:2f:9f:0c (cont) ...
> 2008-03-14 17:10:24,791 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Received message from /172.17.1.45/161 with 
> length 1250: 
> 30:82:04:de:02:01:01:04:06:70:75:62:6c:69:63:a2:82:04:cf:02:04:29:2f:9f:0c 
> (cont) ...
> 2008-03-14 17:10:24,791 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Looking up pending request with handle 
> PduHandle[690986764]
> 2008-03-14 17:10:24,793 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Cancelling pending request with handle 
> PduHandle[690986764]
> 2008-03-14 17:10:24,794 DEBUG [Timer-0] Log4jLogAdapter.debug - Running 
> pending async request with handle PduHandle[690986764] and retry count 
> left 1
> 2008-03-14 17:10:24,794 DEBUG [Timer-0] Log4jLogAdapter.debug - Sending 
> message to 172.17.1.45/161 with length 101: 
> 30:63:02:01:01:04:06:70:75:62:6c:69:63:a5:56:02:04:29:2f:9f:0c (cont) ...
> 2008-03-14 17:10:24,928 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Received message from /172.17.1.45/161 with 
> length 1250: 
> 30:82:04:de:02:01:01:04:06:70:75:62:6c:69:63:a2:82:04:cf:02:04:29:2f:9f:0c 
> (cont) ...
> 2008-03-14 17:10:24,930 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Looking up pending request with handle 
> PduHandle[690986764]
> 2008-03-14 17:10:24,930 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.debug - Cancelling pending request with handle null
> 2008-03-14 17:10:24,931 ERROR [DefaultUDPTransportMapping_0.0.0.0/0] 
> Log4jLogAdapter.error - OIDs in wrong order. Request IDs processed by 
> TableUtils.TableRequest.onResponse: [690986669, 690986671, 690986675, 
> 690986679, 690986683, 690986687, 690986690, 690986694, 690986698, 
> 690986702, 690986706, 690986710, 690986714, 690986718, 690986722, 
> 690986726, 690986730, 690986734, 690986738, 690986742, 690986746, 
> 690986749, 690986754, 690986757, 690986760, 690986764, 690986764]
> 
> Any help with this issue is truly appreciated.
> 
> Cheers,
> Damian
> 
> --------------------------------------------------
> From: "Frank Fock" <fock at agentpp.com>
> Sent: Friday, March 14, 2008 2:52 PM
> To: "Damian" <damianstaro at hotmail.com>
> Cc: <snmp4j at agentpp.org>
> Subject: Re: [SNMP4J] TableUtils - GetBulk responses - Possible race 
> condition
> 
>> Hi Damian,
>>
>> Please try to reproduce the issue with the latest version 1.9.
>> There were a couple of changes since 1.8.1 which may
>> have fixed the issue already.
>>
>> Best regards,
>> Frank
>>
>> Damian wrote:
>>> First of all, congrats for a great suite of products. I do mostly 
>>> Java SNMP managers and snmp4j has been great so far, I have been 
>>> using it for almost 4 years now. I also bought AgentPro and used it 
>>> to generate simulator code that helps us accelerate our testing 
>>> process and also easily reproduce conditions we experience with SNMP 
>>> agents in the field.
>>>
>>> I suspect I might have found a race condition in the processing of 
>>> GetBulk responses within TableUtils. I am using version 1.8.1.
>>>
>>> I am using timeout values of 200ms, 400ms, 800ms and 1600ms (3 
>>> retries). It seems that for this particular table the Switch is 
>>> taking around 600ms to reply.
>>>
>>> In this scenario, most of the time I get good results but once in a 
>>> while I get a TableEvent.STATUS_WRONG_ORDER error.
>>>
>>> I instrumented the TableUtils.onResponse method and I saw it process 
>>> what seemed to be the same PDU (same request ID) twice, and that 
>>> seemed to trigger the STATUS_WRONG_ORDER.
>>>
>>> I am attaching the SNMP4J logs I obtained as well as a Wireshark 
>>> network capture. The request ID that was processed twice was 1314028685.
>>>
>>> One of the things that puzzled me was that SNMP4J kept sending 
>>> requests with that request ID even after the response was received:
>>>
>>> 2008-03-10 15:28:49,433 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Running pending async request with handle 
>>> PduHandle[1314028685] and retry count left 3
>>> 2008-03-10 15:28:49,634 DEBUG [Timer-0] Log4jLogAdapter.debug - 
>>> Running pending async request with handle PduHandle[1314028685] and 
>>> retry count left 2
>>> 2008-03-10 15:28:50,032 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Looking up pending request with handle 
>>> PduHandle[1314028685]
>>> 2008-03-10 15:28:50,032 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Cancelling pending request with handle 
>>> PduHandle[1314028685]
>>> 2008-03-10 15:28:50,034 DEBUG [Timer-0] Log4jLogAdapter.debug - 
>>> Running pending async request with handle PduHandle[1314028685] and 
>>> retry count left 1
>>> 2008-03-10 15:28:50,163 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Looking up pending request with handle 
>>> PduHandle[1314028685]
>>> 2008-03-10 15:28:50,632 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Looking up pending request with handle 
>>> PduHandle[1314028685]
>>> 2008-03-10 15:28:50,834 DEBUG [Timer-0] Log4jLogAdapter.debug - 
>>> Running pending async request with handle PduHandle[1314028685] and 
>>> retry count left 0
>>> 2008-03-10 15:28:51,623 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Looking up pending request with handle 
>>> PduHandle[1314028685]
>>> 2008-03-10 15:28:52,435 DEBUG [Timer-0] Log4jLogAdapter.debug - 
>>> Request timed out: 1314028685
>>>
>>> The following messages might be related:
>>> 2008-03-10 15:28:50,163 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Cancelling pending request with handle null
>>> 2008-03-10 15:28:50,633 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Cancelling pending request with handle null
>>> 2008-03-10 15:28:51,624 DEBUG [DefaultUDPTransportMapping_0.0.0.0/0] 
>>> Log4jLogAdapter.debug - Cancelling pending request with handle null
>>> 2008-03-10 15:28:52,436 DEBUG [Timer-0] Log4jLogAdapter.debug - 
>>> Cancelling pending request with handle null
>>>
>>> Thanks and please let me know if there's any additional information 
>>> that would help in this case.
>>>
>>> Damian
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> SNMP4J mailing list
>>> SNMP4J at agentpp.org
>>> http://lists.agentpp.org/mailman/listinfo/snmp4j
>>
>> -- 
>> AGENT++
>> http://www.agentpp.com
>> http://www.mibexplorer.com
>> http://www.mibdesigner.com
>>

-- 
AGENT++
http://www.agentpp.com
http://www.mibexplorer.com
http://www.mibdesigner.com



More information about the SNMP4J mailing list