[SNMP4J] max-bindings with big tables

Steffen Brüntjen Steffen.Bruentjen at macmon.eu
Tue Jul 24 20:53:27 CEST 2018


Hi Frank


> 1. You configured max-rep-count*max-bindings = 6 > max columns (=5). The opposite is recommended.

Does SNMP support restricting the maximum number of total bindings? By the term "max-bindings" I meant max-num-columns-per-PDU (I wrote this in my first mail:
>>>>> - max-bindings is set to 4 - TableUtils.setMaxNumColumnsPerPDU(int))

Oh, and I didn't configure these values here, I just tried to make up an example that shows the problem. The original results come with max-repetitions=30, maxNumColumnsPerPDU=4.



> [...] should most likely return at least column of the first part

The agent doesn't actually cut off variable bindings to the end of the previous row. That again was just the example I gave to point out the problem. But still this WILL simply happen from time to time by pure coincidence. And it did.


Finally I'm sure this is /not/ a dense-table-only issue. The same problem, that the returned List<TableEvents> contains multiple rows with the same index, can happen with sparse tables.


Anyways, I tried to replace 2.6 by 3.0, but there were some more changes I had to apply. I was using

    public List<TableEvent> getTable(Target target,
                                     OID[] columnOIDs,
                                     OID lowerBoundIndex,
                                     OID upperBoundIndex)

But I can't set the SparseTableMode here. So I wrote my own TableListener, and with that I can, finally, confirm, that the dense table modes work properly. The returned table is correct. But as mentioned, the problem is not specific to dense or sparse tables. Now that I've written my own TableListener, I think I can solve the issue like so (no external libs required): 


CountDownLatch latch = new CountDownLatch(1);
List<TableEvent> table = new ArrayList<>();
Map<OID, TableEvent> rows = new ConcurrentHashMap<>();
TableListener myListener = new TableListener() {

	@Override
	public boolean next(TableEvent event) {
		OID index = event.getIndex();
		rows.compute(index, (idx, prevEvent) -> {
			if (prevEvent == null) {
				table.add(event);
				return event;
			}
			// Merge values from newEvent to the variable bindings
			VariableBinding[] prevColumns = prevEvent.getColumns();
			VariableBinding[] newColumns = event.getColumns();
			for (int i = 0; i < prevColumns.length; i++) {
				if (prevColumns[i] == null) {
					prevColumns[i] = newColumns[i];
				}
			}
			return prevEvent;
		});

		return true;
	}

	@Override
	public boolean isFinished() {
		return latch.getCount() > 0;
	}

	@Override
	public void finished(TableEvent event) {
		latch.countDown();
	}

};

tu.getTable(..., oids, myListener, ...);
latch.await();
return table;


Best regards and thanks a lot for your help
Steffen Brüntjen


-----Original Message-----
From: Frank Fock [mailto:fock at agentpp.com] 
Sent: Montag, 23. Juli 2018 19:57
To: Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>
Cc: snmp4j at agentpp.org
Subject: Re: [SNMP4J] max-bindings with big tables

Hi Steffen,

OK, I understand the difference. Nevertheless, the current snapshot already fixes this issue too. 
Although SNMP4J TableUtils could probably handle this kind of scenario smarter, the scenario you describe is very rare:
1. You configured max-rep-count*max-bindings = 6 > max columns (=5). The opposite is recommended. 
2. The agent seems to cut off a whole row (a whole repetition) to return a PDU below maxResponsePDUSize (or MTU). According to the SNMPv2c/v3 standard, only those VBs should be removed from the response, that actually break the limit. Thus, in your case the agent should most likely return at least one column of the first part of row “1” instead of returning none.

Have you tried the latest 3.0 SNAPSHOT already?
Both dense table modes: 
* denseTableDoubleCheckIncompleteRows
* denseTableDropIncompleteRows
should return your row “1” in one TableEvent.

Best regards,
Frank


> On 23. Jul 2018, at 19:23, Steffen Brüntjen <Steffen.Bruentjen at macmon.eu> wrote:
> 
> Hi!
> 
>> However the problem you described should not happen with a static (unchanged) table, because of the inner logic of TableUtils.
> 
> I'm sorry, but I still believe I was unable to make the problem clear. You wrote, this problem should not appear in tables that don't change OR it may appear when the agent doesn't return rows in lexicographic order. The latter case is perceived just like row creation or row deletion is happening while retrieving the table. I understand that and I can't rule out the possibility that there's an error in the agent, although I have analyzed all the packets in Wireshark. I was also debugging the TableUtils and I still think, the bug is there. So let me try to explain it one last time.
> 
> Let's say we have this configuration:
> 
> max-repetition-count = 2
> max-bindings = 3
> requested table columns = 5
> 
> IDX |  A  |  B  |  C  |  D  |  E  |
> ----+-----+-----+-----+-----+-----+
> 0   |  1  |  2  |  3  |  4  |  5  |
> 1   |  6  |  7  |  8  |  9  | 10  |
> 2   | 11  | 12  | 13  | 14  | 15  |
> 3   | 16  | 17  | 18  | 19  | 20  |
> 
> SNMP4J will ask for A, B, C (max-bindings=3)
> DEVICE will return A.0=1, B.0=2, C.0=3 (DEVICE decides to not send a 2. row because of MTU size) 
> SNMP4J will ask for D, E
> DEVICE will return D.0=4, E.0=5, D.1=9, E.1=10 (max-repetition-count = 2)
> 
> And here we're running into the problem. TableUtils "creates" an inner table in this state:
> 
> IDX |  A  |  B  |  C  |  D  |  E  |
> ----+-----+-----+-----+-----+-----+
> 0   |  1  |  2  |  3  |  4  |  5  |
> 1   |null |null |null |  9  | 10  |
> 
> 
> Now we'll continue:
> 
> SNMP4J will ask for A.0, B.0, C.0 (GETNEXT)
> DEVICE will return A.1=6, B.1=7, C.1=8
> 
> What TableUtils now does is:
> 
> IDX |  A  |  B  |  C  |  D  |  E  |
> ----+-----+-----+-----+-----+-----+
> 0   |  1  |  2  |  3  |  4  |  5  |
> 1   |null |null |null |  9  | 10  |
> 1   |  6  |  7  |  8  |null |null |
> ...
> 
> 
> This is the phenomenon I'm actually observing and that I am trying to describe. So once again: The table doesn't change and the problem is 100% reproducable. In my case:
> 
> - 100% of table retrievals with max-bindings != 4 is ok
> - 100% of table retrievals with max-bindings == 4 is broken 
> 
> 
> This problem will never appear with max-bindings=1 or max-bindings=infinite, and it will never appear when the agent always sends the exact requested repetitions.
> 
> Best regards
> Steffen Brüntjen
> 
> 
> -----Original Message-----
> From: Frank Fock [mailto:fock at agentpp.com] 
> Sent: Donnerstag, 19. Juli 2018 19:35
> To: Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>
> Cc: snmp4j at agentpp.org
> Subject: Re: [SNMP4J] max-bindings with big tables
> 
> Hi Steffen 
> I think I understood your description correctly from the beginning. However the problem you described should not happen with a static (unchanged) table, because of the inner logic of TableUtils. 
> I assume, that the agent does not return the rows in lexicographic order. That would have the same effect as if a row is dynamically appearing during retrieval. 
> 
> I do not want to exclude an off-by-one error in TableUtils but all unit tests I run so far do not indicate that. 
> 
> What agent are you using?
> 
> Nevertheless, the new version will not show the issue you observed with the mode denseTableDoubleCheckIncompleteRows
> 
> Best regards 
> Frank
> 
>> Am 19.07.2018 um 17:20 schrieb Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>:
>> 
>> Hi Frank
>> 
>> 
>> I'm not sure whether we're talking about the same thing. The problem I described is *not* a timinig problem with rows being added to or removed from the table while retrieving rows. The table I am querying doesn't change at all and the problem is highly reproducible. Let's see the example again:
>> 
>> 
>> This is how the List<TableEvent> result should look like and how it actually does - always - when the max-bindings is set to 1 or 32 or some other value.
>> 
>> [ ... 75 normal rows ... ]
>> [1.3.6.1.2.1.31.1.1.1.1.278 = VLAN105, [...], 1.3.6.1.2.1.31.1.1.1.18.278 = service]
>> [1.3.6.1.2.1.31.1.1.1.1.279 = VLAN106, [...], 1.3.6.1.2.1.31.1.1.1.18.279 = reception]
>> [1.3.6.1.2.1.31.1.1.1.1.283 = VLAN110, [...], 1.3.6.1.2.1.31.1.1.1.18.283 = voice]
>> [1.3.6.1.2.1.31.1.1.1.1.373 = VLAN200, [...], 1.3.6.1.2.1.31.1.1.1.18.373 = clients]
>> [1.3.6.1.2.1.31.1.1.1.1.774 = VLAN601, [...], 1.3.6.1.2.1.31.1.1.1.18.774 = VLAN601]
>> [1.3.6.1.2.1.31.1.1.1.1.783 = VLAN610, [...], 1.3.6.1.2.1.31.1.1.1.18.783 = lab6]
>> [ ... everything normal ... ]
>> 
>> 
>> When setting the max-bindings to 4 (I'm requesting 7 columns), I - always - get these TableEvents:
>> 
>> [ ... 75 normal rows ... ]
>> [1.3.6.1.2.1.31.1.1.1.1.278 = VLAN105, [...], 1.3.6.1.2.1.31.1.1.1.18.278 = service] 
>> [1.3.6.1.2.1.31.1.1.1.1.279 = VLAN106, [...], 1.3.6.1.2.1.31.1.1.1.18.279 = reception]
>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.283 = 2, 1.3.6.1.2.1.31.1.1.1.15.283 = 0, 1.3.6.1.2.1.31.1.1.1.18.283 = voice]
>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.373 = 2, 1.3.6.1.2.1.31.1.1.1.15.373 = 0, 1.3.6.1.2.1.31.1.1.1.18.373 = clients]
>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.774 = 2, 1.3.6.1.2.1.31.1.1.1.15.774 = 0, 1.3.6.1.2.1.31.1.1.1.18.774 = VLAN601]
>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.783 = 2, 1.3.6.1.2.1.31.1.1.1.15.783 = 0, 1.3.6.1.2.1.31.1.1.1.18.783 = lab6]
>> [1.3.6.1.2.1.31.1.1.1.1.283 = VLAN110, 1.3.6.1.2.1.31.1.1.1.17.283 = 2, 1.3.6.1.2.1.31.1.1.1.6.283 = 0, 1.3.6.1.2.1.31.1.1.1.10.283 = 0, null, null, null]
>> [1.3.6.1.2.1.31.1.1.1.1.373 = VLAN200, 1.3.6.1.2.1.31.1.1.1.17.373 = 2, 1.3.6.1.2.1.31.1.1.1.6.373 = 0, 1.3.6.1.2.1.31.1.1.1.10.373 = 0, null, null, null]
>> [1.3.6.1.2.1.31.1.1.1.1.774 = VLAN601, 1.3.6.1.2.1.31.1.1.1.17.774 = 2, 1.3.6.1.2.1.31.1.1.1.6.774 = 0, 1.3.6.1.2.1.31.1.1.1.10.774 = 0, null, null, null]
>> [1.3.6.1.2.1.31.1.1.1.1.783 = VLAN610, 1.3.6.1.2.1.31.1.1.1.17.783 = 2, 1.3.6.1.2.1.31.1.1.1.6.783 = 0, 1.3.6.1.2.1.31.1.1.1.10.783 = 0, null, null, null]
>> [ ... everything normal ... ]
>> 
>> 
>> The returned List<TableEvent> contains 4 more results, because 4 table rows are split into two TableEvents. We can see that these indexes seem to have two rows:
>> index=283
>> index=373
>> index=774
>> index=783
>> 
>> 
>> It's like this table
>> 
>> 
>> IDX |  A  |  B  |  C  |  D
>> ----+-----+-----+-----+-----
>> 0   |  1  |  2  |  3  |  4
>> 1   |  5  |  6  |  7  |  8
>> 2   |  9  | 10  | 11  | 12
>> 3   | 13  | 14  | 15  | 16
>> 
>> 
>> becomes something like this when obtained by TableUtils:
>> 
>> IDX |  A  |  B  |  C  |  D
>> ----+-----+-----+-----+-----
>> 0   |  1  |  2  |  3  |  4
>> 1   | null| null|  7  |  8        <-- index=1
>> 2   | null| null| 11  | 12        <-- index=2
>> 1   |  5  |  6  | null| null      <-- index=1
>> 2   |  9  | 10  | null| null      <-- index=2
>> 3   | 13  | 14  | 15  | 16
>> 
>> 
>> I tried to describe the reason for this, but it's a bit complicated I admit. Of course it's also possible that I didn't understand your answer correctly. Sorry for the confusion in that case. Then I'd be willing to grasp how sparse and dense tables are the reason for this problem. 
>> 
>> Thanks for the clarification on tooBig errors with GETBULK requests!
>> 
>> 
>> Best regards
>> Steffen Brüntjen
>> 
>> 
>> 
>> -----Original Message-----
>> From: Frank Fock [mailto:fock at agentpp.com] 
>> Sent: Donnerstag, 12. Juli 2018 08:41
>> To: Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>
>> Cc: snmp4j at agentpp.org
>> Subject: Re: [SNMP4J] max-bindings with big tables
>> 
>> Hi Steffen,
>> 
>> If the agent sends a tooBig error on a GETBULK request, then this is an error in the agent. See RFC3416 4.2.3:
>> 
>>   If the size of the message encapsulating the Response-PDU
>>        containing the requested number of variable bindings would be
>>        greater than either a local constraint or the maximum message
>>        size of the originator, then the response is generated with a
>>        lesser number of variable bindings.  This lesser number is the
>>        ordered set of variable bindings with some of the variable
>>        bindings at the end of the set removed, such that the size of
>>        the message encapsulating the Response-PDU is approximately
>>        equal to but no greater than either a local constraint or the
>>        maximum message size of the originator.  Note that the number
>>        of variable bindings removed has no relationship to the values
>>        of N, M, or R.
>> 
>> For the issue you reported, there is no general solution, because it interferes with sparse tables. 
>> A solution would either decrease the performance for sparse tables or will filter out sparse rows. 
>> The latter is not acceptable for intentionally sparse tables. 
>> For dense tables, the filtering could be the best option. Although it would hide new rows although the command generator already detected them.
>> 
>> I am currently about to add an option for getDenseTable to activate a filtering for new rows that appear during the table retrieval and are therefore incompletely received. Would that help you?
>> 
>> Best regards,
>> Frank 
>> 
>>> On 9. Jul 2018, at 19:45, Steffen Brüntjen <Steffen.Bruentjen at macmon.eu> wrote:
>>> 
>>> Hi Frank
>>> 
>>> Thank you for having a look at it. I agree, the performance with many bindings is indeed *much* higher and yes, values should be retrieved row-by-row in order to avoid data inconsistencies. But there are also problems with many bindings:
>>> 
>>> 1. Since the agent can not - in the contrast to max-repetition-count - decide how many values to send, the packet size might get too big if you have a table with many (big) columns.
>>> 
>>> 2. There are agents that get into trouble when many columns are requested. This often results in timeouts (no tooBig error) and then there's no other option to requesting fewer bindings.
>>> 
>>> Maybe the proposed change is the way to go, it's decent, but effective (I believe).
>>> 
>>> Best regards
>>> Steffen 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Frank Fock [mailto:fock at agentpp.com] 
>>> Sent: Freitag, 6. Juli 2018 18:55
>>> To: Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>
>>> Cc: snmp4j at agentpp.org
>>> Subject: Re: [SNMP4J] max-bindings with big tables
>>> 
>>> Hi Steffen,
>>> I will try to reproduce this issue. 
>>> Independent from the result, the parameters for TableUtils are not suitable for your setup. The maxNumColumnsPerPDU has to be as large as possible. Otherwise the overall performance will be bad and the likelihood of incomplete table rows increases significantly (through changes in the agent while TableUtils operate).
>>> Best regards 
>>> Frank
>>> 
>>>> Am 06.07.2018 um 10:20 schrieb Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>:
>>>> 
>>>> Hi!
>>>> 
>>>> I'm using SNMP4J version 2.6.2.
>>>> 
>>>> Best regards
>>>> Steffen
>>>> 
>>>> -----Original Message-----
>>>> From: Frank Fock [mailto:fock at agentpp.com] 
>>>> Sent: Donnerstag, 5. Juli 2018 19:37
>>>> To: Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>
>>>> Cc: snmp4j at agentpp.org
>>>> Subject: Re: [SNMP4J] max-bindings with big tables
>>>> 
>>>> Hi Steffen 
>>>> What SNMP4J version are you using?
>>>> Best regards 
>>>> Frank
>>>> 
>>>>> Am 05.07.2018 um 17:04 schrieb Steffen Brüntjen <Steffen.Bruentjen at macmon.eu>:
>>>>> 
>>>>> Hi Frank
>>>>> 
>>>>> I believe I found an issue in the TableUtils class. In certain scenarios, the returned List<TableEvent> from getTable(Target target, OID[] columnOIDs, OID lowerBoundIndex, OID upperBoundIndex) will contain incomplete and duplicate rows.
>>>>> 
>>>>> 
>>>>> Here's an extract of an exemplary List<TableEvent> for a "good" result:
>>>>> 
>>>>> [1.3.6.1.2.1.31.1.1.1.1.278 = VLAN105, [...], 1.3.6.1.2.1.31.1.1.1.18.278 = service]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.279 = VLAN106, [...], 1.3.6.1.2.1.31.1.1.1.18.279 = reception]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.283 = VLAN110, [...], 1.3.6.1.2.1.31.1.1.1.18.283 = voice]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.373 = VLAN200, [...], 1.3.6.1.2.1.31.1.1.1.18.373 = clients]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.774 = VLAN601, [...], 1.3.6.1.2.1.31.1.1.1.18.774 = VLAN601]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.783 = VLAN610, [...], 1.3.6.1.2.1.31.1.1.1.18.783 = lab6]
>>>>> 
>>>>> 
>>>>> But in some specific circumstances, I get results like these:
>>>>> 
>>>>> [ ... 75 normal rows ... ]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.278 = VLAN105, [...], 1.3.6.1.2.1.31.1.1.1.18.278 = service] 
>>>>> [1.3.6.1.2.1.31.1.1.1.1.279 = VLAN106, [...], 1.3.6.1.2.1.31.1.1.1.18.279 = reception]
>>>>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.283 = 2, 1.3.6.1.2.1.31.1.1.1.15.283 = 0, 1.3.6.1.2.1.31.1.1.1.18.283 = voice]
>>>>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.373 = 2, 1.3.6.1.2.1.31.1.1.1.15.373 = 0, 1.3.6.1.2.1.31.1.1.1.18.373 = clients]
>>>>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.774 = 2, 1.3.6.1.2.1.31.1.1.1.15.774 = 0, 1.3.6.1.2.1.31.1.1.1.18.774 = VLAN601]
>>>>> [null, null, null, null, 1.3.6.1.2.1.31.1.1.1.14.783 = 2, 1.3.6.1.2.1.31.1.1.1.15.783 = 0, 1.3.6.1.2.1.31.1.1.1.18.783 = lab6]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.283 = VLAN110, 1.3.6.1.2.1.31.1.1.1.17.283 = 2, 1.3.6.1.2.1.31.1.1.1.6.283 = 0, 1.3.6.1.2.1.31.1.1.1.10.283 = 0, null, null, null]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.373 = VLAN200, 1.3.6.1.2.1.31.1.1.1.17.373 = 2, 1.3.6.1.2.1.31.1.1.1.6.373 = 0, 1.3.6.1.2.1.31.1.1.1.10.373 = 0, null, null, null]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.774 = VLAN601, 1.3.6.1.2.1.31.1.1.1.17.774 = 2, 1.3.6.1.2.1.31.1.1.1.6.774 = 0, 1.3.6.1.2.1.31.1.1.1.10.774 = 0, null, null, null]
>>>>> [1.3.6.1.2.1.31.1.1.1.1.783 = VLAN610, 1.3.6.1.2.1.31.1.1.1.17.783 = 2, 1.3.6.1.2.1.31.1.1.1.6.783 = 0, 1.3.6.1.2.1.31.1.1.1.10.783 = 0, null, null, null]
>>>>> [ ... everything normal ... ]
>>>>> 
>>>>> 
>>>>> Here we find some rows split into two: One block with the first 4 columns set null, and another block with the last 3 columns set null.
>>>>> 
>>>>> 
>>>>> Here's the setting which produces the second result:
>>>>> 
>>>>> - max-bindings is set to 4 - TableUtils.setMaxNumColumnsPerPDU(int)
>>>>> - max-repetitions is set to 30 - TableUtils.setMaxNumRowsPerPDU(int)
>>>>> - the device returns many rows (like 120)
>>>>> - the table request contains more columns than max-bindings
>>>>> - the table request contains not a multiple of max-bindings
>>>>> - the problem will also depend on MTU size, but that's not important here
>>>>> 
>>>>> 
>>>>> This is what happens:
>>>>> 
>>>>> 1. TableUtils will request the first 4 columns
>>>>> 2. device returns 60 variable bindings, that's 15 cells per column
>>>>> 3. TableUtils will request the latter 3 columns
>>>>> 4. device returns 60 variable bindings, that's 20 cells per column
>>>>> 
>>>>> This is repeating until all bindings are retrieved. So far, so good. The problem is now, that all second requests (step 3) will receive more rows, and so these requests will reach index 283 (as in the example above) earlier. I did some debugging and I think I found the reason: When the first results with index 283 are received (step 3), TableUtils creates a row for this index. That row is filled up with null values for the first 4 columns so that it's size equals 7 (and not 3). Having size=7, the row is considered finished too soon. TableUtils then prunes these incomplete but finished rows from rowCache. When TableUtils receives the other 4 columns for row 283, it creates a new row with the same index.
>>>>> 
>>>>> 
>>>>> How to fix?
>>>>> 
>>>>> I believe a moderately easy, but not very good way to fix this is to have the little part contain the first 3 columns, not the remaining last 3 columns:
>>>>> 
>>>>> max-bindings = 4
>>>>> columns: .1, .2, .3, .4, .5, .6, .7
>>>>> 1. packet should contain: .1, .2, and .3
>>>>> 2. packet should contain: .4, .5, .6, and .7
>>>>> 
>>>>> Number of columns for the first packet is NumColumnsTotal % maxBindings.
>>>>> Number of columns for the other packets is maxBindings.
>>>>> 
>>>>> 
>>>>> Please tell me if you need more information or if my method invocation is wrong.
>>>>> 
>>>>> 
>>>>> Best regards
>>>>> Steffen Brüntjen
>>>>> _______________________________________________
>>>>> SNMP4J mailing list
>>>>> SNMP4J at agentpp.org
>>>>> https://oosnmp.net/mailman/listinfo/snmp4j
>> 
> 



More information about the SNMP4J mailing list