[SNMP4J] RaceCondition in AgentX Communication
Frank Fock
fock at agentpp.com
Tue Mar 15 21:52:10 CET 2011
Hi,
I will try to improve the reconnection behavior
of SNMP4J-AgentX in this regard. However,
the root cause is the bug in the NET-SNMP
master agent. The AgentX RFC of course, does
not define any order of requests and
responses. Since the AgentX master agent
has a couple of bugs for years now,
I always recommend using SNMP4J-AgentX
as master agent and use the NET-SNMP
agent as subagent.
Best regards,
Frank
On 15.03.2011 12:14, ZAUBZER Stephan wrote:
> Hi,
>
> I have a Problem with the stability of the connection from an AgentXSubAgent
> to the MasterAgent. The SubAgent is used in a System running 24/7 (RHEL
> kernel 2.6.18, NET-SNMP version: 5.3.2.2). Upon disconnection (detected via a
> TransportStateListener) a reconnection is Performed. However due to a race
> condition within snmp4j the connection is not properly re-established leaving
> the system with 2 connections (one with an established AgentX session and one
> "plain TCP connection"). However the connection with the established AgentX
> Session is not used, because the DefaultTcpTransportMapping only memorizes
> the "last" connection to a certain address.
>
> Details:
>
> The following sequence of events can be observed from the log files (snmp4j
> logging has been activated and set to DEBUG level for analyzing this issue):
>
> At some point in time the "request=>response" pattern gets "out of sync" and
> we experience a pattern in the form "request1, request2 => response1,
> response2". Right after an occurrence of such a pattern the connection seems
> to be dropped by the peer. While re-establishing the connection, snmp4j is
> running into race-condition preventing a proper re-connection.
> For the sake of readability the following log output is manually filtered to
> contain only the relevant pieces. Lines beginning with "///" describe the
> relevant pieces from the removed log output.
>
> 02:51:04,807 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
> /// Reading message containing GetPDU with transactionID=13485
> 02:51:04,807 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
> /// Reading message containing GetPDU with transactionID=13487
> 02:51:04,811 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
> /// Sending message containing response with transactionID=13485
> 02:51:04,811 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
> /// Reading message containing GetPDU with transactionID=13489
> 02:51:04,812 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
> /// Obviously trying to send response for transactionID=13487
> 02:51:04,812 WARN [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: java.io.IOException: Broken
> pipe
> /// The broken pipe oviously happened when trying to send data.
>
> /// Now the reconnector is kicking in
> 02:51:04,813 DEBUG [SubAgentXImpl periodic reconnector]
> org.snmp4j.transport.DefaultTcpTransportMapping: Socket for address
> '127.0.0.1/705' is closed, opening it...
> /// Obviously at the same time snmp4j tries to send the last pending
> outgoing message (response for transactionID=13489)
> 02:51:04,813 DEBUG [AgentXSubAgent.0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Socket for address
> '127.0.0.1/705' is closed, opening it...
> ///Then the connection for the first opened socket succeeds (local port
> 56565)
> 02:51:04,813 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is connectable
> ///Requesting to write AgentX-Open-PDU
> 02:51:04,813 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
> ///Write AgentX-Open-PDU
> 02:51:04,814 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
> ///Reading response => AgentX Session opened successfully.
> /// Then the connection for the second opened socket succeeds (local port
> 56566)
> 02:51:04,814 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
> org.snmp4j.transport.DefaultTcpTransportMapping: Key is connectable
> ///This socket now overwrites the first socket (on which the successfully
> opened session is relying) within the socket map in
> DefaultTcpTransportMapping
> ///Now the last pending outgoing message from the previous session is sent
> on the second opened socket
> ///As well as the first AgentX-Register-PDU for the new session (again via
> the second opened socket).
>
> ///The following log entry clearly shows that snmp4j is using the second
> opened socket (port 56566) for registering instead of the one on which the
> session was opened (port 56565)
> 02:51:04,815 DEBUG [SubAgentXImpl periodic reconnector]
> org.snmp4j.transport.DefaultTcpTransportMapping: Looking up connection for
> destination '127.0.0.1/705' returned:
> SocketEntry[peerAddress=127.0.0.1/705,socket=Socket[addr=/127.0.0.1,port=705,
> localport=56566],lastUse=Tue Mar 15 02:51:04 UTC 2011]
>
> From there on all outgoing messages (for registering MOs) are sent via the
> second socket where NO Session has been established. Therefore the
> MasterAgent only responds with 257 (no session open).
>
> So we have several problems here:
> - A response PDU for a "broken session" opens a new socket, that will
> "overwrite" the socket of a newly created session.
> - A session does not have an associated socket but only an associated
> address.
> - The TransportMapping does not know anything about the session to which a
> packet belongs and therefore should NOT be allowed to choose the socket on
> which to send.
>
> I clearly see, that it will definitely not be a minor patch to fix these
> problems resulting in the race condition described above. However I am now
> trying to defer the reconnection process by 5 seconds. This is however only a
> workaround. This will allow any pending outgoing messages to be sent
> (possibly opening a new socket). After 5 seconds all pending tasks should be
> finished and a new socket connection may already be established which will
> then in turn be used to open the new session.
>
> Regarding the loss of the connection: EACH connection loss follows the same
> pattern. The "request1 => response1, request2 => response2" pattern gets out
> of sync ("request1, request2 => response1, response2") and then obviously the
> MasterAgent closes the connection. This however (without having read the
> AgentX RFC) appears to me as a valid scenario that is not correctly handled
> by the MasterAgent. Does anybody of you happen to know if such a sequence of
> requests and responses is permissible in the AgentX Protocol?
> - If yes, then the connection loss is obviously triggered by a bug in
> NET-SNMP.
> - If not, then snmp4j MUST not read a message from the socket before the
> response for the last read message is still pending.
>
> Best regards
> Stephan
>
> _______________________________________________
> SNMP4J mailing list
> SNMP4J at agentpp.org
> http://lists.agentpp.org/mailman/listinfo/snmp4j
--
AGENT++
http://www.agentpp.com
http://www.snmp4j.com
http://www.mibexplorer.com
http://www.mibdesigner.com
More information about the SNMP4J
mailing list