Andres Kroonmaa wrote:
> Not sure what you are talking about, but yes, I agree that there are ways
> to reduce needed io to 1-2 per objects. But compared to mmap this might
> be another thread.
I am speaking of io ops much less than one per object for storing, this
is by sacrifying some disk space and memory, and doing some things that
seems utterly stupid at first but which can be cleverly done to avoid
most of the stupidity ;-)
The basic idea is a cyclical file system (similar to a huge FIFO) but
with some active recoupling to maintain a good hit rate for a cache.
> Why? As well we could insist on cutting it by 3 or 4. The calc was as I
> said rude and wrong ;) And if we take that avg disk io is 10msec and we
> are doing about 88 random ios/sec, I can't see how we could do that in
> 440ms/sec...
There is no need to use random I/O for storing objects. This is what
makes the difference. All writes can be done in large sequencial chunks
which keeps drives and I/O buses happy.
> By async reads you mean threads? Are we ready talking about threads
> as of portable solution? That would be cool of course.
The current async-io implementation can't be regarded as very portable,
but I beleive threads are if used in the way they are intended to be
used, without excessive filedescriptor sharing and other strange things
which are bad for both portability and stability.
> I'm afraid that if data blocks are not sequential, readv does not give
> much benefit, sorted or not. You just avoid OS/app context switches.
> To arrange for readv you perhaps waste cpu comparably. And besides,
> you can't start writing to network until _all_ blocks in readv are
> serviced.
Yes, the main idea of readv is to minimise context switches, and as a
side effect a kind of disk hartbeat.
> I'd imagine this be done as linked-list where you can always find a
> right place to insert new request into the chain, and then use regular
> read/write. btw, I guess OS does also sort multiple requests somehow.
Most OS only does multiple requests when multiple devices are involved,
and I was speaking about one I/O thread per device to fuel the OS with
parallell I/O load for the devices involved.
Some OS:es may support multiple outstanding I/O requests on the same
device, and on those it could be beneficial from having multiple reader
threads for the same device.
> > For writing sequential I/O should be used, either using write threads or
> > chunked mmap() combined with periodic (once per 1/2 second)
>
> I'm afraid, sequential writes can be only possible when there is ample of
> free disk space, ie. cache is not full and fragmentation low. You can't hope
> on that. Perhaps there are ways to ensure sequential free space, but it
> wouldn't be trivial.
I think you are stuck in the thinking of a normal filesystem here. It is
not that hard to rewrite the recycling policy in such a way that there
always is a couple of minutes contigous free space for storing fresh
objects.
> In addition, I think that when accessing disk via cached interface, you
> have no guarantees that disk ops are scheduled in the same order as you
> dictate from application. It makes it harder to implement elevator
> optimisations. But then again, cached interface to raw partition might
> have other benefits.
Above you effectively said that elevator optimisation won't give a
performance gain..
> What has to be solved is the problem that no single block size fits all.
About the only reason to use a block size at all is to keep the file
block index a 32 bit integer, and to avoid having small objects cross
more physical block boundaries than needed.
> Note that mmap seems very attractive for that. But it has many objections.
mmap is very attractive for writing sequencially in the main thread
without blocking. For random I/O or reading it is not very attractive.
> ( btw, msync(ASYNC) is not supported everywhere, eg. FreeBSD-30)
Then use a separate sync thread when not supported. No big deal.
> Please describe multilevel cyclical FS, or point me to relevant papers.
I am planning on this. Need some time (which I don't have) to formulate
the ideas and to collect relevant references.
Basic idea is that writing is always done sequencially in a cyclical
fashion (when the end is reached, restart at the beginning).
Unfortunately we most likely needs something extra besides this to avoid
throwing away objects that we are interested in keeping in the cache,
and for this I have tree ideas which are emerging and I am trying to
find a proper name for this. The full working name is something like
chunked cyclical multilevel feedback file system which should give an
idea of what is involved. multilevel is probably the wrong word here but
I havent figured out another word to use for that part yet, maybe
garbage collecting is more appropriate for describing that part but the
truth is somewhere in between.
/Henrik
Received on Tue Jul 29 2003 - 13:15:56 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:02 MST