select() and a thread pool
Mar. 7th, 2007 07:34 pmHere's the challenge: a select()-based event dispatcher that can scale from single-threaded to multi-threaded efficiently, preferably as simple as possible and able to be effectively edge-triggered (even if select() itself is level-triggered).
Note that the point here is to run the event handlers concurrently. I've used threads in the past to work around deficiencies in select(), putting "quiet" file descriptors on one thread that would sleep most of the time, and "busy" file descriptors on another (since select() scales based on the number of file descriptors it watches, this made the active set more efficient), but it would still only use one thread for event handlers. It was just about sleeping more efficiently.
My first idea was that all threads would be symmetrical and would thus all sleep in select(). But that doesn't work, obviously, because as soon as one file descriptor is ready, all the threads wakes up, which is rather wasteful. Splitting the set of file descriptors between the threads isn't so good, because a scheduler would then be needed to balance the load between threads, and I prefer to leave scheduling to the kernel, as much as possible. On the other hand, this could allow better cache locality, handling the same file descriptor on the same thread (possibly bound to the processor) as often as possible, but on the third hand, let's not get ahead of ourselves.
It seems that the only way is to keep all the file descriptors together, with just one thread calling select(). When it returns, it should be examined and all the events put on a list. Then it goes into a loop of taking an event from the head of the list, posting the semaphore of the next thread (which could be itself in the single-threaded case), calling the event's handler, then waiting on the semaphore. If there was no event left, it goes back to select() instead.
There's a few good bits and a few bad bits about this design. A good bit is that the semaphores that keeps the other threads sleeping also protect the list at the same time (that's why it's posted after taking the next event). A bad bit is that at the end, we would be going into the select() before all the handlers are done, and they might want to add some file descriptors. This could be fixed by having a counter of threads busy running handlers, and the last thread that would be out of events would be the one going back into select(), but this would also make the load more "bursty" in a busy server, where there's really work all the time, but at every round of select(), all the threads but one would go to sleep, only to be reawakened.
I think, in the end, that the usual Unix solution comes to the rescue: add a file descriptor! Having the reading end of a pipe in the select() invocation, and adding a file descriptor from a handler would write to the writing end, waking up the select() as needed. A bit of a shame, since it would be unnecessary in the single-threaded case, but oh well, that's how it is sometimes...
Anyone has suggestions for improvements? They are most welcome!
Note that the point here is to run the event handlers concurrently. I've used threads in the past to work around deficiencies in select(), putting "quiet" file descriptors on one thread that would sleep most of the time, and "busy" file descriptors on another (since select() scales based on the number of file descriptors it watches, this made the active set more efficient), but it would still only use one thread for event handlers. It was just about sleeping more efficiently.
My first idea was that all threads would be symmetrical and would thus all sleep in select(). But that doesn't work, obviously, because as soon as one file descriptor is ready, all the threads wakes up, which is rather wasteful. Splitting the set of file descriptors between the threads isn't so good, because a scheduler would then be needed to balance the load between threads, and I prefer to leave scheduling to the kernel, as much as possible. On the other hand, this could allow better cache locality, handling the same file descriptor on the same thread (possibly bound to the processor) as often as possible, but on the third hand, let's not get ahead of ourselves.
It seems that the only way is to keep all the file descriptors together, with just one thread calling select(). When it returns, it should be examined and all the events put on a list. Then it goes into a loop of taking an event from the head of the list, posting the semaphore of the next thread (which could be itself in the single-threaded case), calling the event's handler, then waiting on the semaphore. If there was no event left, it goes back to select() instead.
There's a few good bits and a few bad bits about this design. A good bit is that the semaphores that keeps the other threads sleeping also protect the list at the same time (that's why it's posted after taking the next event). A bad bit is that at the end, we would be going into the select() before all the handlers are done, and they might want to add some file descriptors. This could be fixed by having a counter of threads busy running handlers, and the last thread that would be out of events would be the one going back into select(), but this would also make the load more "bursty" in a busy server, where there's really work all the time, but at every round of select(), all the threads but one would go to sleep, only to be reawakened.
I think, in the end, that the usual Unix solution comes to the rescue: add a file descriptor! Having the reading end of a pipe in the select() invocation, and adding a file descriptor from a handler would write to the writing end, waking up the select() as needed. A bit of a shame, since it would be unnecessary in the single-threaded case, but oh well, that's how it is sometimes...
Anyone has suggestions for improvements? They are most welcome!
no subject
Date: 2007-03-07 07:00 pm (UTC)But then again, Burma.
no subject
Date: 2007-03-07 07:09 pm (UTC)The thing is, I have this thing where I believe where I have this concept that things I spew out should all be in a single place. Now, I use the tags to export feeds to other sites, so the technically-oriented people get just the tech stuff (and really, most of the time the slightest hints of my personal life makes them uncomfortable), but those poor schmucks on LJ, just friending me, they're doomed!
I find it very disappointing that LJ confuses all sorts of things together, like "the people I don't mind read my friends-locked posts" and "the people I want to read", for example. It'd be neat if you could do some stuff like "yeah, pphaneuf is cool, but his 'power management' and 'scalability' tags are way too hardcore for me, no thanks".
no subject
Date: 2007-03-08 03:09 am (UTC)How about this? Create a coding/advotago/scalability/tech sub-group to your friends group.
Or there's the ever popular LJ cuts.
But really, I like reading your stuff, especially the picture stuff, 'cause I'm insanely jealous of the pretty pictures you create. I just don't like not being able to reply to cries for help when I am unable to/lack the ability to help.
Big hugs to
no subject
Date: 2007-03-08 09:08 am (UTC)I use some super awesome magic software to snort in massive amounts of information daily (including what comes out of my friends' LiveJournals), and I hardly ever visit my LJ friends page (it's so quaint, having to scroll down to the next entry and all!). For those people living in the future like me, LJ cuts don't work (but thankfully, they don't help, since the awesomeness makes it easy to skim over what we're not interested in really fast). So I often forget about them entirely.
Ah, if only the LiveJournal friends page was more like Google Reader, you too could live in the future! It's nice here, and we have drinks with little umbrellas in them! ;-)
But I'll try to use them more LJ cuts nonetheless, even if I can't see (or miss) them. :-)
no subject
Date: 2007-04-06 08:38 pm (UTC)no subject
Date: 2007-04-06 08:52 pm (UTC)As for providing a "public face" from my LJ, I think I've got something worked out. Feeds can now be filtered on a tag, so there's some stuff possible there, I think. For example, see my Advogato page. I'm planning on using Feedburner (see here for example) to cobble up something useful.
no subject
Date: 2007-04-06 08:58 pm (UTC)no subject
Date: 2007-04-06 09:01 pm (UTC)messages
Date: 2007-03-07 08:01 pm (UTC)i.e. you need a different operating system, because you're not going to persuade the linux kernel developers to redesign linux from a monolithic to a microkernel. well - you can try - but good luck with it.
Re: messages
Date: 2007-03-10 07:44 pm (UTC)no subject
Date: 2007-03-07 08:57 pm (UTC)Do you need to at all?
Date: 2007-03-07 10:11 pm (UTC)lock loop: while there are events: dequeue one unlock handle it lock if someThreadPolling: condition wait else: someThreadPolling = true poll for events lock someThreadPolling = false fire condition unlockThat neglects the adding/removing entries; I agree a self-pipe would be the way to go there.
But William Ahern suggested that this isn't worth the complexity: it's better to just partition the file descriptors among the threads. It seems like there'd be some problems with unbalanced ones there. I promised benchmarks: if some simple balancing scheme scales nearly linearly, there's no point in working out this more complex multithreaded loop. I haven't had the time to carry through on that, even though it's been months...
select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 04:28 pm (UTC)select is O(n) with its n argument, which is the highest file descriptor plus one. Keep in mind that an FD_SET is just a bitvector. Both userspace and the kernel have to go through every entry in the vector to at least see if it's set. And in practice, this appears to be enough work that it makes some difference.
This is in contrast to poll(), which has the behavior you describe. You can see benchmarks of the difference here (http://www.atnf.csiro.au/people/rgooch/benchmarks/polling.html). Look at the "Linux 2.1.52-pre-patch-2 (includes poll patch) + select & fastpoll.v1 patches" numbers, which are most representative of systems made in the past however many years since the benchmark was done. You can see the select() times don't go down nearly as much when checking just the high descriptors as the poll() ones do.
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 04:56 pm (UTC)poll() is also less efficient than select() with densely filled set of file descriptors, which also appears when you throw enough at it. Which is why, in actuality, I used select() for the "quiet thread" (usually more quiet connections at any given time) and poll() in the "active thread". But if I had too many active file descriptors, the size of the poll() array would be the limiting factor.
I could have made something that switched between one or the other, but that was too complicated, introducing branches where there wasn't before, maintaining two sets of data structures where only one was used at any given time. I switched to epoll instead. ;-)
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 05:02 pm (UTC)Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 09:14 pm (UTC)The thing is, I want something crossplatform (so it should be able to get by with just select() or poll()) and that can scale up and down, both in load and in machine specs. For the higher loads, it will have to be something good like epoll or kqueue(), but for many other situations on weird little platforms, select() will be the only way, and should be sufficient.
Is libevent getting fat, these days? A web server? Isn't that my job? :-)
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 09:23 pm (UTC)And I don't think libevent's getting too heavy. I'm not planning on using any of the fancy web stuff, but on a standard Linux system I don't care too much about the code size. If I did, I'd push for putting it into a separate .so, then everyone would be happy. (-levhttp -levent or something).
One major reason you might not want to use libevent is if it can't do what you want multithreaded-wise. I don't think that crowd will be too supportive of the sort of complicated behavior you want, and I'm not sure if you could/would want to implement it as a wrapper around libevent rather than changes to the core.
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 09:52 pm (UTC)Note that in your example of TCP proxying, there's not much locking involved. Each direction is it's own state machine (watch for read, read into a buffer, watch for write, write the buffer out, repeat), and since each is watching only one file descriptor at a time, it can all be done lockless, even if both sides are being read from at the same time. Amazinging, isn't it?
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 09:53 pm (UTC)Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-08 10:01 pm (UTC)Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-09 06:33 am (UTC)Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-09 08:08 am (UTC)There's at least one corner case I don't quite understand. Say you have empty buffers and your stateful firewall sends you a pair of RST packets. You had to have been watching both descriptors, and they both become available at once. Two threads simultaneously start working on opposite ends, locking their half and attempting to read into their buffer. The read fails. Now each wants to destroy the pair. What happens?
The obvious answer would be for a failed read to lock both sides, then do the destroy. But it can't lock the other side safely without relinquishing its own lock, or the order would be inconsistent and it might deadlock. Neither can it relinquish its own lock, or the other might have destroyed it in the meantime. Seems like you have to follow a tricky procedure to destroy pairs...and (perhaps due to the hour), it's eluding me.
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-10 11:47 am (UTC)When they read an EOF, they "write" a shutdown(SHUT_WR) on the other end, that much is for sure. After that, we won't be getting anything to read, for sure, so we won't run this handler again.
The question is, after that, did the other side already do that? If it did, we can just close both file descriptors and destroy the corresponding state. If not, we can just return, never to be called again (quite possibly removing the interest in the read events, but we could leave that to the destructor too, it's not critical).
Enter the "half_closed" bool, protected by a mutex. It's kind of silly, because it's only ever accessed at the end of the process, but at least there won't be much contention on that mutex!
Note that for more complex interactions between multiple file descriptors, I have a containing object that can make sure all the callbacks are single-threaded between each others (something a bit like COM apartments, but not quite the same).
Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-10 11:20 pm (UTC)Re: select doesn't scale with number of file descriptors being watched
Date: 2007-03-11 03:23 pm (UTC)Otherwise, it's back to locks, which sucks when it's just for a specific exceptional case...