The End Of The World (As We Know It)!
May. 16th, 2008 07:25 pmOk, here we go:
Event-driven non-blocking I/O isn't the way anymore for high-performance network servers, blocking I/O on a bunch of threads is better now.
Wow, I can't believe I just wrote that! Here's a post that describes some of the reasons (this is talking more about Java, but the underlying reasons apply to C++ as well, it's not just JVMs getting wackier at optimizing locking). It depends on your platform (things don't change from being true to being false just out of the blue!), and more specifically, I have NPTL-based Linux 2.6 in mind, at the very least (NPTL is needed for better futex-based synchronization, and 2.6 for the O(1) scheduler and low overhead per thread). You also want to specify the smallest stacks you can get away with, and you also want a 64-bit machine (it has a bigger address space, meaning it will explode later).
The most important thing you need is to think and not be an idiot, but that's not really new.
And when I say "bunch of threads", I really mean it! My current "ideal design" for a web server now involves not just a thread per connection, but a thread per request (of which there can be multiple requests per connection)! Basically, you want one thread reading a request from the socket, then once it's read, fork it off to let it do its work, and have the writing of the reply to the socket be done on the request thread. This allows for as much pipelining as possible.
Still, event-driven I/O is not completely useless, it is still handy in the case of protocols that have long-lived connections which stay quiet for a long time. Examples of that are IRC and LDAP servers, although it's possible that with connection keep-alive, one might want to do that with an HTTP server as well, using event notification to see that a request is arrived, then hand it back to a thread to actually process it.
I also now realize that I was thinking too hard in my previous thoughts on using multiple cores. One could simply have a "waiting strategy" (be it select() or epoll), and something else to process the events (an "executor", I think some people call that?). You could then have a simple single-threaded executor that just runs the callbacks right there and then, no more fuss (think of WvStreams' post_select()), or you could have a fancy-pants thread-poll, whatever you fancied. I was so proud of my little design, now it's all useless. Oh well, live and learn...
Event-driven non-blocking I/O isn't the way anymore for high-performance network servers, blocking I/O on a bunch of threads is better now.
Wow, I can't believe I just wrote that! Here's a post that describes some of the reasons (this is talking more about Java, but the underlying reasons apply to C++ as well, it's not just JVMs getting wackier at optimizing locking). It depends on your platform (things don't change from being true to being false just out of the blue!), and more specifically, I have NPTL-based Linux 2.6 in mind, at the very least (NPTL is needed for better futex-based synchronization, and 2.6 for the O(1) scheduler and low overhead per thread). You also want to specify the smallest stacks you can get away with, and you also want a 64-bit machine (it has a bigger address space, meaning it will explode later).
The most important thing you need is to think and not be an idiot, but that's not really new.
And when I say "bunch of threads", I really mean it! My current "ideal design" for a web server now involves not just a thread per connection, but a thread per request (of which there can be multiple requests per connection)! Basically, you want one thread reading a request from the socket, then once it's read, fork it off to let it do its work, and have the writing of the reply to the socket be done on the request thread. This allows for as much pipelining as possible.
Still, event-driven I/O is not completely useless, it is still handy in the case of protocols that have long-lived connections which stay quiet for a long time. Examples of that are IRC and LDAP servers, although it's possible that with connection keep-alive, one might want to do that with an HTTP server as well, using event notification to see that a request is arrived, then hand it back to a thread to actually process it.
I also now realize that I was thinking too hard in my previous thoughts on using multiple cores. One could simply have a "waiting strategy" (be it select() or epoll), and something else to process the events (an "executor", I think some people call that?). You could then have a simple single-threaded executor that just runs the callbacks right there and then, no more fuss (think of WvStreams' post_select()), or you could have a fancy-pants thread-poll, whatever you fancied. I was so proud of my little design, now it's all useless. Oh well, live and learn...
no subject
Date: 2008-05-17 12:35 am (UTC)At that point why bothering with asynchronous IO? Most of the time you need to know the result to continue, so why not isolating the sequencial task in its own thread and leave the rest to it's business?
Maybe I should just have said:
<AOL>
Me too
</AOL>
You know the rules!
Date: 2008-05-17 02:19 am (UTC)no subject
Date: 2008-05-17 02:35 am (UTC)Hmmm....
I'll have to think about that. I still wonder about latency issues. How much does it cost to move data from one core's cache to another's? Wouldn't it make more sense for threads to be data-centric in order to decrease latency and make it more likely that something will be in cache?
It's clear that some level of threading is a good idea nowadays.
One interesting thing that Java has done that I think is an overall useful concept is making some data structures immutable so that no locking operations are required to access them.
This is going to require some thinking.
no subject
Date: 2008-05-17 04:35 pm (UTC)TO THE FUTURE ! :)
no subject
Date: 2008-05-17 04:43 pm (UTC)Even now, some things are complicated. I'm told that some printf() implementations use more than 16 KB of stack, so if you run with a very small stack, there's some functions that just become unusable.
So it's not all a given. If you have a dual quad-core Xeon and Linux 2.4, you'd still have wanted an event-based design, with multiple threads to use all the cores, yes, but not one per connection or something, more like one or two per available core.
no subject
Date: 2008-05-17 05:05 pm (UTC)no subject
Date: 2008-05-17 05:40 pm (UTC)BTW nobody use 2.4 anymore.
no subject
Date: 2008-05-17 05:57 pm (UTC)For the cache, that's another good point to bring up, but is also in favour of threads now. Good schedulers already have CPU affinity taken into account, and each thread (should!) operate on its own data as much as possible, where a multi-threaded event-driven system has each thread operating on many unrelated data structures. You can then have a "thread affinity" for each "task", so that they're hopefully scheduled on the same processor, but that's just replicating what the kernel already does for threads themselves.
I was certainly already considering "some level of threading", but still thinking that an overall event-driven system is better, with a small number of threads (some number derived from the number of cores). But now I'm thinking lots of threads with blocking I/O, and maybe keeping an event-driven part just for dealing with idle connections, no more.
I find interesting the history of data structures in Java, where they started, a bit naively, with data structures that had a lot of locking (every method being synchronized, for example, heralded as the future back them!), and things evolving over time, now with some interesting data structures geared specifically toward multi-threading, such as concurrent hash maps, for example.
Immutable data structures have had good influence, indeed, for example, IO-Lite, maybe not directly inspired by that, but sharing the same principles. I remember seeing something where there was a reliance on garbage collection to extract the best out of this, but thankfully, reference counting (as provided by shared_ptr, say) is apparently sufficient. I need to look this up...
no subject
Date: 2008-05-17 07:48 pm (UTC)I was thinking of having a 'thread per context' instead of a 'thread per request' might be a good idea to attempt to take maximum advantage of the affinity scheduling of most modern processors. A thread per request doesn't give the scheduler enough information and each request in the same context may well be handled on a different CPU.
It would be nice if thread schedulers could use information about the contents of CPU caches to assign CPU affinity. Maybe some kind of special table mapping CPU number to physical memory ranges.
Though for a request of more than 10-20 milliseconds the overhead of pushing context between CPUs on the same box is likely negligible. For a request of 100-200 milliseconds the overhead of pushing context between CPUs on different boxes in the same datacenter is likely negligible.
no subject
Date: 2008-05-17 09:07 pm (UTC)I've had to support 2.4 until not so long ago, it's been annoyingly clingy, for some reason. I think that reason was RHEL, which kept using a 2.4 kernel for a ridiculously long time, and while the new 2.6-based version was out, the kind of people who use RHEL are also really slow at upgrading...
But I was just pointing that out in the sense that back with 2.4, threads weren't sensible, and now they are.
no subject
Date: 2008-05-17 10:17 pm (UTC)for atomic ops (mutexes ;-))
I haven't had a look at it yet.
Re: You know the rules!
Date: 2008-05-17 10:18 pm (UTC)no subject
Date: 2008-05-17 10:22 pm (UTC)Now, what you're asking about mapping the content of CPU cache is basically what having a 1:1 mapping of "tasks" (what you call a context?) to threads does. You use the same data in a thread, and the scheduler's CPU affinity mechanism tries gets you back to the CPU with the cache content it needs. Multiplexing multiple tasks in a single thread obfuscates this information from the scheduler, unfortunately, so you end up having to do it in userspace.
I've been wishing for all sorts of things, like an non-blocking mlock() that sends an event when the pages are in core, but in the end, fixing threads to not suck is also another option I should have considered, and in a way is much more general, if done right. Maybe dangerously so, but what can you do...
About migrating between boxes in a datacenter, from the kind of loads I've seen that could use that often can't, because they also tend to operate on large datasets. Being bound by the memory bus is coming up more and more often, at the very edge of what we can do, it seems...
I'll follow-up with another post soon, keep an eye out!
no subject
Date: 2008-05-17 10:28 pm (UTC)shared_ptr uses an atomic type for its reference count, if you combine that with pointing at something const (immutable) and overwriting the shared_ptr, you can make something not unlike RCU, with all the goodness that this entails!
Re: You know the rules!
Date: 2008-05-17 10:28 pm (UTC)Re: You know the rules!
Date: 2008-05-17 10:54 pm (UTC)