Re: [code] [for discussion] map-trie from Robert Collins on 2014-06-14 (squid-dev)

From: Robert Collins <robertc_at_robertcollins.net>
Date: Sat, 14 Jun 2014 19:38:03 +1200

FWIW my intent with libTrie was always to head for an LPC Trie
implementation as the only implementation. This should be faster still
than the current naive trie, and faster than the std::map compression
technique you've implemented.

[ IIRC the LPC paper was

Implementing a Dynamic Compressed Trie from Stefan Nilsson, Matti
Tikkanen ]- I have the pdf floating around here somewhere, but a quick
google just hit paywalls these days :(.

-Rob

On 12 June 2014 10:43, Kinkie <gkinkie_at_gmail.com> wrote:
> Hi,
> I've done some benchmarking, here are the results so far:
> The proposal I'm suggesting for dstdomain acl is at
> lp:~kinkie/squid/flexitrie . It uses the level-compact trie approach
> I've described in this thread (NOT a Patricia trie). As a comparison,
> lp:~kinkie/squid/domaindata-benchmark implements the same benchmark
> using the current splay-based implementation.
>
> I've implemented a quick-n-dirty benchmarking tool
> (src/acl/testDomainDataPerf); it takes as input an acl definition -
> one dstdomain per line, as if it was included in squid.conf, and a
> hostname list file (one destination hostname per line).
>
> I've run both variants of the code against the same dataset: a 4k
> entries domain list, containing both hostnames and domain names, and a
> 18M entries list of destination hostnames, both matching and not
> matching entries in the domain list (about 7.5M hits, 10.5M misses).
>
> Tested 10 times on a Core 2 PC with plenty of RAM - source datasets
> are in the fs cache.
> level-compact-trie: the mean time is 11 sec; all runs take between
> 10.782 and 11.354 secs; 18 Mb of core used
> full-trie: mean is 7.5 secs +- 0.2secs; 85 Mb of core.
> splay-based: mean time is 16.3sec; all runs take between 16.193 and
> 16.427 secs; 14 Mb of core
>
> I expect compact-trie to scale better as the number of entries in the
> list grows and with the number of clients and requests per second;
> furthermore using it removes 50-100 LOC, and makes code more readable.
>
> IMO it is the best compromise in terms of performance, resources
> useage and expected scalability; before pursuing this further however
> I'd like to have some feedback.
>
> Thanks
>
>
> On Wed, Jun 4, 2014 at 4:11 PM, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
>> On 06/04/2014 02:06 AM, Kinkie wrote:
>>
>>> there are use cases
>>> for using a Trie (e.g. prefix matching for dstdomain ACL); these may
>>> be served by other data strcutures, but none we could come up with on
>>> the spot.
>>
>> std::map and StringIdentifier are the ones we came up with on the spot.
>> Both may require some adjustments to address the use case you are after.
>>
>> Alex.
>>
>>
>>> A standard trie uses quite a lot of RAM for those use cases.
>>> There are quite a lot of areas where we can improve so this one is not
>>> urgent. Still I'd like to explore it as it's synchronous code (thus
>>> easier for me to follow) and it's a nice area to tinker with.
>>>
>>> On Tue, Jun 3, 2014 at 10:12 PM, Alex Rousskov
>>> <rousskov_at_measurement-factory.com> wrote:
>>>> On 06/03/2014 08:40 AM, Kinkie wrote:
>>>>> Hi all,
>>>>> as an experiment and to encourage some discussion I prepared an
>>>>> alternate implementation of TrieNode which uses a std::map instead of
>>>>> an array to store a node's children.
>>>>>
>>>>> The expected result is a worst case performance degradation on insert
>>>>> and delete from O(N) to O(N log R) where N is the length of the
>>>>> c-string being looked up, and R is the size of the alphabet (as R =
>>>>> 256, we're talking about 8x worse).
>>>>>
>>>>> The expected benefit is a noticeable reduction in terms of memory use,
>>>>> especially for sparse key-spaces; it'd be useful e.g. in some lookup
>>>>> cases.
>>>>>
>>>>> Comments?
>>>>
>>>>
>>>> To evaluate these optimizations, we need to know the targeted use cases.
>>>> Amos mentioned ESI as the primary Trie user. I am not familiar with ESI
>>>> specifics (and would be surprised to learn you want to optimize ESI!),
>>>> but would recommend investigating a different approach if your goal is
>>>> to optimize search/identification of strings from a known-in-advance set.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Alex.
>>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Francesco

-- 
Robert Collins <rbtcollins_at_hp.com>
Distinguished Technologist
HP Converged Cloud

Received on Sat Jun 14 2014 - 07:38:11 MDT

This archive was generated by hypermail 2.2.0 : Sat Jun 14 2014 - 12:00:11 MDT