> I agree that directory structure adds overhead, but it adds also abstraction
> layer that allows you to use virtually any type of future FS type provided
> by vendor. By bypassing this abstraction, you limit yourself to your FS only.
And why is that a problem. The FS would be a #idef'ed thing controlling its
use at compile time. If you wanted to try it out, just do it. If you don't
trust it, that's your choice.
> You may be right eventually, but I'm really worried about the complexity
> of the task. On Solaris, for eg. VM system and FS cache are very tightly
> coupled, and its efficiency is really very hard to beat. If you want to
> implement all of it yourself, you'd have to deal with all those caches,
> and at the same time conserve precious ram. Then, you have to lock that
> cache into physical ram as there is no point in a cache that could get
> paged out to swap.
Yes, that's true. My design does call for the buffered pages to be
mlock()ed into RAM, as any filesystem implementation would. The trick here
is that sure, you reduce your available RAM by mlock()ing, but the
advantage is that you get back control over the size of the system
buffer cache to keep it at a performance minimum, rather than losing
that control to the OS. Sure the OS still needs a buffer cache for other
IO, but what are we talking about here? The odd appending to access.log
and cache.log. Nothing that requires more than about 10M of system buffer
cache.
> > > > > frag/ 8K chunk filesystem. Basically we want to write a stripped
> > > > > down version of UFS that allows indexing by inode alone. I have a
> > > > > design here at Connect that works out to an average of 2.5 disk
> > > > > accesses per new object pulled in, and 1.7 disk accesses per cache
> > > > > hit, based on a sample taken from our live caches.
> > > > >
> > > > > Compare this to an average of approx 7 disk accesses per second with UFS on
> > > > > a new object write, and an average of 3.5 disk accesses per cache hit.
> > >
> > > How do you measure that?
The same way you did below.
> > Plugged in disk access traces into some simulations to arrive at the 2.5/1.7
> > values. For UFS I just watched the amount of disk accesses with OS tools
> > as opposed to what Squid was doing over the same period. It would seem that
> > UFS does a great deal more disk accesses that it needs to for Squid.
>
> Umm, what OS tools? If I look at how my cache is running, I see that for
> about 30 urls/sec my disks are doing about 16-20 reads/sec and about 25-35
> writes/sec, giving the average about 1-1.5 of disk accesses _per URL_.
>
> 98/08/26 11:30:01 - Total time: 30007 ms (0.01 hours)
> acl: 155 ms 0.52% 959 calls 0.162 ms/call 31.96 calls/sec
> connect: 164 ms 0.55% 703 calls 0.233 ms/call 23.43 calls/sec
> diskr: 2319 ms 7.73% 1665 calls 1.393 ms/call 55.49 calls/sec
> diskw: 823 ms 2.74% 2582 calls 0.319 ms/call 86.05 calls/sec
> openr: 1959 ms 6.53% 524 calls 3.739 ms/call 17.46 calls/sec
> openw: 2647 ms 8.82% 352 calls 7.520 ms/call 11.73 calls/sec
> unlink: 50 ms 0.17% 3 calls 16.667 ms/call 0.10 calls/sec
>
> (output of iostat -x 30 for the same timeframe as squid stats above)
> extended device statistics
> device r/s w/s kr/s kw/s wait actv svc_t %w %b
> sd8 16.0 26.8 110.0 198.7 1.3 0.3 37.9 2 30
>
> 98/08/26 11:30:31 - Total time: 30018 ms (0.01 hours)
> acl: 207 ms 0.69% 1161 calls 0.178 ms/call 38.68 calls/sec
> connect: 200 ms 0.67% 864 calls 0.231 ms/call 28.78 calls/sec
> diskr: 2817 ms 9.38% 1732 calls 1.626 ms/call 57.70 calls/sec
> diskw: 873 ms 2.91% 2787 calls 0.313 ms/call 92.84 calls/sec
> openr: 2369 ms 7.89% 605 calls 3.916 ms/call 20.15 calls/sec
> openw: 2036 ms 6.78% 433 calls 4.702 ms/call 14.42 calls/sec
>
> extended device statistics
> device r/s w/s kr/s kw/s wait actv svc_t %w %b
> sd8 18.1 26.1 120.5 196.1 1.4 0.3 39.3 1 30
>
> I can see that squid did 20+14=34 opens/sec and disks have done 18 reads and 44
> ops in total. Thats fairly efficient. If you sum together all squid disk ops, then
> we have to face that for about 185 disk ops issued by squid, system is doing about
> 44 disk accesses, and that sounds way better than what you claimed for UFS.
>
> For me, it means that my OS's FS cache is working pretty efficiently, it has most
> hot dir files in ram, it has enough ram for read and write buffering, and it is
> able to optimise disk accesses alot.
>
> ( Note that above samples are taken from cache 97% full (13.6GB of 14GB). You see so
> few unlinks because I use modified squid that truncates and overwrites files rather
> than first unlinks then creates. (for 2Mil hits in 24h we usually have about 700
> unlinks only)
Is that 14GB spread across how many disks?? Are using using a single 14GB
spindle? Is sd8 that spindle? Are there more spindles used for caching?
Taking another sample for us I see:
38 URLs/sec over a 30 second period.
12 x 4GB disks
extended disk statistics
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
sd30 4.7 2.9 31.3 22.3 0.0 0.2 31.6 0 8
sd31 6.1 11.0 59.4 78.7 0.0 0.9 55.2 0 16
sd32 5.2 15.1 35.4 113.4 0.0 1.7 84.7 0 20
sd45 5.7 10.1 39.9 78.5 0.0 1.3 81.5 0 16
sd46 4.3 8.5 27.8 59.5 0.0 0.6 48.1 0 12
sd47 3.8 4.1 24.1 31.2 0.0 0.6 75.2 0 8
sd60 6.1 11.4 42.2 80.0 0.0 1.1 64.0 0 17
sd61 3.5 6.5 23.4 56.3 0.0 0.5 55.0 0 10
sd62 5.5 10.1 34.5 79.1 0.0 0.8 52.1 0 16
sd75 3.8 9.8 24.0 72.4 0.0 0.8 56.2 0 13
sd76 4.4 10.1 32.7 73.7 0.0 0.9 64.5 0 15
sd77 4.7 11.3 32.0 84.8 0.0 1.0 59.6 0 16
We see a total of 57.8 reads/sec, and 110.9 writes/sec.
This is at a medium load period where the Unified Buffer Cache is not being
overly strained with excessive disk IO.
The hit rate for those 30 seconds was 52% object miss rate (624 object writes)
and 6 memory hits and 560 disk based hits (object reads) + a few errors.
That comes out to an average of:
560/30 = 18.66 object reads/sec for 57.8/18.66 = 3.1 disk ops/object read
624/30 = 20.8 object wrutes/sec for 110.9/20.8 = 5.3 disk ops/object write
This gets worse as load increases to the peaks of about 3.5/7.0 I quoted
before. I imagine its getting worse because the namei cache starts
thrashing. Also I've turned `fastfs' on for these filesystems in the
meantime which batches directory updates at the expense of possible corruption
in the event of a system failure, and that's why the disk writes have
dropped off a bit too. `fastfs' is not a desirable long-term solution.
If your system is performing better, I'm happy for you. We however have
a genuine need for a filesystem that performs better than what we are seeing.
You can't dismiss a need just because you can't see one for yourself.
> By writing your own FS, you'd have to use direct io (unbuffered & uncached by OS)
> as you would write to raw partitions and to achieve same numbers you'd need
> to also implement very efficient caching. You'd need to allocate large amounts
> of ram for that task and you'd need to make sure that this ram is not paged
> out to swap, you'd need to implement clustering, read-ahead, free-behind,
> avoid fragmentation, etc.etc.
The design I've written already does the above principles. Read ahead is
already implied by pulling in 8K chunks at a time. 90% of all objects are
less than 8K. Don't need much read-ahead in those circumstances.
Clustering also becomes relatively moot in that situation too. I haven't
designed for clustering but it would be VERY easy to extend the design to
grab clusters of sequential blocks and do read ahead if necessary.
> In my view, there is too much work to overcome something that can be and
> should be fixed by efficient caching. And what needs to be done around squid,
> is to minimise chances that it busts any possible OS optimisations.
> If that needs more knowledge of OS, its fine, as you can't tune your
> box until you understand your OS' internals anyway...
OS's do general purpose optimisations. For squid at V.high loads OSes
break. I'd rather have a specially designed FS thats consistent for everyone,
than one I can only get at by poking OS dependant variables.
> > I believe we're starting to cap out in this area. I understand very well
> > about disk optimisation and Squid is getting close to the end of things
> > it can do with regards to UFS without making big assumptions about what
> > an OS is doing and the disk layout. As for disk usage patterns, what
> > did you have in mind with regard to a system that is fairly random. We
> > already use the temporal and spatial locality inherent is cache operation
> > to make things work better. I'm open for any suggestions as to what more
> > can be done? Remember I'm after quantum leaps in performance, not just
> > 5% here and 10% there from minor fiddling.
>
> OK, I'll give it a try ;)
I see Kevin has just replied to the below stuff. I'm satisfied with his
explanation.
<snip lots of good analysis about the sort of work a UFS filesystem needs to
do>
> What do you think?
<snip brief description of Connect's SquidFS block sizings.>
Note the revised design allows for 8192 bytes (1 chunk) to be used with
the inode (and a minimum of 1 frag or 4096 bytes). This now means for
87% of all objects they will fit into 1 disk access.
> This sounds good. But
> - would you implement cache in ram? how'd you lock cache from being paged out to swap?
Disk buffer cache RAM would be mlock()ed. The amount of this is user
definable.
> - would you implement read-ahead?
The design allows for it. Not implemented yet as it of dubious value given
87% of all objects are access in 1 8K minimal data access chunk anyway after
we reduce the effective size of that 8K chunk by 512 bytes for the inode.
> - would you implement free-behind?
Keh? Do you mean write-behind? Meaning to modify a block mid-way through
you have to pull it off disk first and then modify it and write it back.
If you mean that, then yes.
If you mean freeing blocks in the buffer cache as we go, then of course.
They get placed onto a free list and tossed as necessary. The free list
becomes the data/inode cache who size is configurable by the user. Given
that Squid maintains its own object cache in memory, the user would set
the free list cache size fairly small to prevent RAM wastage.
> - would you implement delayed-writes,
Yes - as a default. Synchronous writes are supported too.
> - clustering?
Design supports. Not implemented as is of dubious value.
> - data caching, its LRU?
Yes - user configurable as amount.
> - fsck?
Yes.
> - work over network (LAN)?
Not relevant. If someone wanted to write a user-level NFS style access to
the filesystem they could. Do you foresee a need for this?
> - spanning over multiple spindles?
If required then would be done by the OS. OS level disk striping sits
below the filesystem level an most OSes I'm aware of. I believe this is not
important anyway as Squid already handles multiple separate spindles well.
Stew.
-- Stewart Forster (Snr. Development Engineer) connect.com.au pty ltd, Level 9, 114 Albert Rd, Sth Melbourne, VIC 3205, Aust. Email: slf@connect.com.au Phone: +61 3 9251-3684 Fax: +61 3 9251-3666Received on Tue Jul 29 2003 - 13:15:53 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:54 MST