Andres Kroonmaa wrote:
> In fact, to mimic LRU should be easy. Whenever we fetch object from disk
> to service client hit, we can append that object to the write queue, that
> way moving all accessed objects to the front of LRU. This would make fifo
> into real LRU. With small objects it could even add little overhead.
Sounds like you have read my first message on cyclic filesystems, sent
some weeks ago ;-)
Appending a HIT object to the write queue is almost for free with
respect to service time. The object is already read in, and we probably
need to write out other objects as well. To avoid having frequently
requested objects taking up a unproportionably large portion of the
cache my idea is to rewrite hits if they are located more than a certain
(configurable) distance from the head.
> How are you planning to implement the metadata?
Ok. you haven't read the first message ;-)
I'll try to summarise all current thoughts so far:
* To minimize latency caused by writes, writes should be done
sequentially in larger units (1MB at a time or something similar).
* To be able to guarantee that writes can be done sequentially a FIFO
type storage policy should be used, possibly with extensions to avoid
throwing away objects we want to keep (probably) and to defragment
unused space (less likely, early estimates puts this in the range of
2-5% and it is non-trivial to implement).
* Also, with a FIFO style replacement policy, it is very hard to justify
the overhead incurred by a standard filesystem. Probably better to use
raw partitions, both from a I/O performance level and how much memory
that is required by the OS for caches/buffers.
* Storage space is maintained in larger chunks (lets say in the range of
10 minutes of data).
* Disk store index is physical location. How closely objects can be
packed inside each chunk is determined by the index key size, and key
size is in turn determined by the amount of memory we can spend on it. A
estimate is that 64 bits will be used, of which 16 is "cache_dir", but
we may get by with 48 bits (40 bits index, 8 bits "cache_dir"). This is
a increase by 4 bytes on 32 bit machines, but as no LRU list is
maintained another 12 bytes (24 on 64bit machines) is saved by not
having a dlink_node so memory requirements is actually less per GB of
disk. On a second thought the dlink_node may be needed to cluster
objects to their storage chunk, so it is perhaps not possible to
eliminate this, so worst case is 4 bytes more per store entry on 32-bit
machines and best case is 10 bytes less.
* On-disk metadata is kept separately for each chunk together with the
data, The exception is when objects has to be released without being
replaced by another one. When this happens a release entry is kept in a
later chunk (the chunk where Squid found that it needs to remove the
object from cache by some reason). Another exception is when a (large)
object crosses chunk boundaries, here the metadata log entry points to a
earlier chunk.
* On disk metadata for a storage chunk is static for the lifetime of the
chunk. Once written it is not touched.
* Storage space is reused one chunk at a time.
* To mimic LRU, reused objects are rewritten at the head if old enough.
A plus from this is that validated/refreshed objects can get their
headers updated correctly even if the object isn't changed.
* Checkpoints are kept as two storage chunk pointers: last fully
completed storage chunk, and the oldest non-reused chunk. These
checkpoints may actually be implemented as timestamps in the chunks
themselves.
* In-memory index are rebuilt by reading the meta data indexes from the
storage blocks, beginning with the oldest.
* One-way elevator optimization is used for I/O, possibly with periodic
direction change to have an even wear on the drive heads. The reasons
why one-way elevator optimization is used instead of two way are: 1. it
gives a more even service time, 2. pending blocks are likely very sparse
at the turn point, and if we are doing a multi-track seek then we may
just as well seek over the whole drive.
* Chaining of object fragments is needed to store larger objects where
we could not afford to wait for the whole object until storing it on
disk.
* A write object queue buffer should be used to order the written
objects somewhat. This is for avoiding storage of aborted objects, to
limit object fragmenting/interleaving and perhaps to cluster related
objects.
* The current store object format already contains a certain level of
metadata redundancy. There is probably no need to extend this further to
reliably detect when things go wrong.
/Henrik
Received on Tue Jul 29 2003 - 13:15:56 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST