Hello,
The attached patch contains some of the Store changes we made while
working on adding SMP-shared support to cache_dirs. These changes were
made while chasing and fixing a particular set of bugs introduced and
exposed by shared Rock Store cache_dirs.
I wish the changes were better isolated but many are inter-dependent and
most were discovered/developed at the same time while trying to make the
code work. Fully isolating all of them will not be possible and
isolating many will take a long time.
If you care about Store and have a chance, please review the change log
below and the attached patch. I expect the final set of changes to be
significantly larger so it would be nice to get feedback about this
smaller and relatively compact set first, before it gets lost in the noise.
I hand-removed most changes specific to merging SMP Rock Store support
into trunk. If you want to look at that work-in-progress, please see my
squid/3p2-rock branch on Launchpad. I hope to submit the whole thing for
your review in a few weeks.
--------- change log ---------------
Changes revolving around making Store work with SMP-shared max-size
cache_dirs:
* Added MemObject::expectedReplySize() and used it instead of object_sz.
When deciding whether an object with a known content length can be
swapped out, do not wait until the object is completely received and its
size (mem_obj->object_sz) becomes known (while asking the store to
recheck in vain with every incoming chunk). Instead, use the known
content length, if any, to make the decision.
This optimizes the common case where the complete object is eventually
received and swapped out, preventing accumulating potentially large
objects in RAM while waiting for the end of the response. Should not
affect objects with unknown content length.
Side-effect1: probably fixes several cases of unknowingly using negative
(unknown) mem_obj->object_sz in calculations. I added a few assertions
to double check some of the remaining object_sz/objectLen() uses.
Side-effect2: When expectedReplySize() is stored on disk as StoreEntry
metadata, it may help to detect truncated entries when the writer
process dies before completing the swapout.
* Removed mem->swapout.memnode in favor of mem->swapout.queue_offset.
The code used swapout.memnode pointer to keep track of the last page
that was swapped out. The code was semi-buggy because it could reset the
pointer to NULL if no new data came in before the call to doPages().
Perhaps the code relied on the assumption that the caller will never
doPages if there is no new data, but I am not sure that assumption was
correct in all cases (it could be that I broke the calling code, of course).
Moreover, the page pointer was kept without any protection from page
disappearing during asynchronous swapout. There were "Evil hack time"
comments discussing how the page might disappear.
Fortunately, we already have mem->swapout.queue_offset that can be fed
to getBlockContainingLocation to find the page that needs to be swapped
out. There is no need to keep the page pointer around. The
queue_offset-based math is the same so we are not adding any overheads
by using that offset (in fact, we are removing some minor computations).
* Added "close how?" parameter to storeClose() and friends.
The old code would follow the same path when closing swapout activity
for an aborted entry and when completing a perfectly healthy swapout. In
non-shared case, that could have been OK because the abort code would
then release the entry, removing any half-written entry from the index
and the disk (but I am not sure that release happened fast enough in
100% of cases).
When the index and disk storage is shared among workers, such
"temporary" inconsistencies result in truncated responses being
delivered by other workers to the user because once the swapout activity
is closed, other workers can start using the entry.
By adding the "close how?" parameter to closing methods we allow the
core and SwapDir-specific code to handle aborted swapouts appropriately.
Since swapin code is "read only", we do not currently distinguish
between aborted and fully satisfied readers: The readerGone enum value
applies to both cases. If needed, the SwapDir reading code can make that
distinction by analyzing how much was actually swapped in.
* Moved "can you store this entry?" code to virtual SwapDir::canStore().
The old code had some of the tests in SwapDir-specific canStore()
methods and some in storeDirSelect*() methods. This resulted in
inconsistencies, code duplication, and extra calculation overheads.
Making this call virtual allows individual cache_dir types to do custom
access controls.
The same method is used for cache_dir load reporting (if it returns
true). Load management needs more work, but the current code is no worse
than the old one in this aspect, and further improvements are outside
this change scope.
* Minimized from-disk StoreEntry loading/unpacking code duplication.
Moved common (and often rather complex!) code from store modules into
storeRebuildLoadEntry, storeRebuildParseEntry, and storeRebuildKeepEntry.
* Do not set object_sz when the entry is aborted because the true object
size (HTTP reply headers + body) is not known in this case. Setting
object_sz may fool client-side code into believing that the object is
complete.
This addresses an old RBC's complaint.
* When swapout initiation fails, release StoreEntry. This prevents the
caller code from trying to swap out again and again because swap_status
becomes SWAPOUT_NONE.
TODO: Consider add SWAPOUT_ERROR, STORE_ERROR, and similar states. It
may solve several problems where the code sees _NONE or _OK and thinks
everything is peachy when in fact there was an error.
* Always call StoreEntry::abort() instead of setting ENTRY_ABORTED manually.
* Rely on entry->abort() side-effects if ENTRY_ABORTED was set.
* Added or updated comments to better document current code.
* Added operator << for dumping StoreEntry summary into the debugging
log. Needs more work to report more info (and not report yet-unknown info).
------------------------------------
Thank you,
Alex.
This archive was generated by hypermail 2.2.0 : Tue Feb 15 2011 - 12:00:04 MST