[SNMP4J] SNMP4J, UDP, network buffers and EPERM

Tue Oct 14 12:50:19 CEST 2014

TL; DR: Does SNMP4J provide "transmit pacing" for UDP?  Does it handle
Linux's "EPERM error when" buffers overflow? (By handle I mean beyond
throwing an exception and failing).

I'm going to start with some background,  We have a large complex (overly
so) system that monitors some "stuff" using SNMP4J.  It generally works.  I
have an integration test suite that drives our system. We use a dedicated
machine with dozens if IP addresses that is programmed to respond to SNMP
requests in a way that the integration test suite expects.  For the
purposes of this converstaion, There's one specific test class   with 11
tests, and when I run it against our production code all of it's tests
pass very consistently.   Our complex system depends on Mule and JMS.  Most
if not all SNMP requests being made are being sent through JMS to an
SnmpExecutor service.  That Mule service in turn calls a Java class
(SnmpQueryExecutor) to synchronously resolve the request for SNMP data (the
SNMP request is asynchronous, but the Java code has it's own wait for the
answer or timeout) before proceeding.  The JMS client blocks waiting for
the SNMP request to complete (or timeout) before continuing on.

In attempt to simplify our complex system, I made a refactoring (on some
execution paths) to call the Java class directly, bypassing the JMS and
Mule part.  The integration tests now fail intermittently.  There are about
4 tests that sometimes fail.  The nature of the failures is also not
consistent.  Initially, one of them would fail on most runs (4 out 5).
After reducing the code-paths that use the new code to exactly one, and
this has dropped to about 1 failure every 3 or 4 runs.  I've learned that I
can make the test pass reliably by adding a 1 second delay to the new
code-path.  This change was for investigative purposes only.   It suggests
that there is a race condition or some type of failure that is related to
doing too much SNMP too fast, and since Mule and JMS add a fair amount of
overhead, it's probably been masked for some time.

After a lot of work, I was able to discover that some of the failures are
caused by this exception:
java.io.IOException: Operation not permitted
    at java.net.PlainDatagramSocketImpl.send(Native Method)
    at java.net.DatagramSocket.send(DatagramSocket.java:676)
    at
org.snmp4j.transport.DefaultUdpTransportMapping.sendMessage(DefaultUdpTransportMapping.java:117)
    at
org.snmp4j.transport.DefaultUdpTransportMapping.sendMessage(DefaultUdpTransportMapping.java:42)
    at
org.snmp4j.MessageDispatcherImpl.sendMessage(MessageDispatcherImpl.java:198)
    at
org.snmp4j.MessageDispatcherImpl.sendPdu(MessageDispatcherImpl.java:498)
    at
org.snmp4j.util.MultiThreadedMessageDispatcher.sendPdu(MultiThreadedMessageDispatcher.java:127)
    at org.snmp4j.Snmp.sendMessage(Snmp.java:1004)
    at org.snmp4j.Snmp.send(Snmp.java:974)
    at org.snmp4j.Snmp.send(Snmp.java:958)
    ....

While digging for information about this, I found this thread
https://github.com/typesafehub/play-plugins/issues/64 which suggests at the
end that this error happens on Linux systems when the network buffers get
full. (Yes, I'm developing on a Linux system).  Digging for more
information about that, I found this thread
http://compgroups.net/comp.protocols.tcp-ip/udp-socket-sendto-eperm/2624182,
where they talk about UDP and "transmit pacing".  Essentially they say it's
the programmers responsibility to not send UDP packets too fast.  Seems
reasonable.

So, my assumption is that the UDPTransport should be doing this.    I did
look through some of the code, and while I did not see anything that would
do this, that doesn't mean it's not there.  Is it?  Is the IOException
(EPERM) causing "transmit pacing" or even normal retries of UDP to not work?

More information:
One test that fails (and the most SNMP active, I think), makes 5 SNMP
requests of 3 different IP addresses.  The addresses are all simple GET
operations, totalling 8 OIDs in all.  Some of these requests are in
parallel by different threads.

I'm very open to any other suggestions people want to make as to why this
change would cause this behavior.  All help appreciated.

David Corbin