pphaneuf | Entries tagged with scalability

Being the end of the world and all, I figure I should go into a bit more details, especially as

omnifarious went as far as commenting on this life-altering situation.

He's unfortunately correct about a shared-everything concurrency model being too hard for most people, mainly because the average programmer has a lizard's brain. There's not much I can do about that, unfortunately. We might be having an issue of operating systems here, rather than languages, for that aspect. We can fake it in our Erlang and Newsqueak runtimes, but really, we can only pile so many schedulers up on each others and convince ourselves that we still make sense. That theme comes back later in this post...

omnifarious's other complaint about threads is that they introduce latency, but I think he's got it backward. Communication introduces latency. Threads let the operating system reduce the overall latency by letting other runs whenever it's possible, instead of being stuck. But if you want to avoid the latency of a specific request, then you have to avoid communication, not threads. Now, that's the thing with a shared-everything model, is that it's kind of promiscuous, and not only is it tempting to poke around in memory that you shouldn't, but sometimes you even do it by accident, when multiple threads touch things that are on the same cache line (better allocators help with that, but you have to be careful still). More points in the "too hard for most people" column.

His analogy of memcached with NUMA is also to the point. While memcached is at the cluster end of the spectrum, at the other end, there is a similar phenomenon with SMP systems that aren't all that symmetrical, multi-cores add another layer, and hyper-threading yet another. All of this should emphasize how complicated writing a scheduler that will do a good job of using this properly is, and that I'm not particularly thrilled at the idea of having to do it myself, when there's a number of rather clever people trying to do it in the kernel.

What really won me over to threading is the implicit I/O. I got screwed over by paging, so I fought back (wasn't going to let myself be pushed around like that!), summoning the evil powers of mlockall(). That's where it struck me that I was forfeiting virtual memory, at this point, and figured that there had to be some way that sucked less. To use multiple cores, I was already going to have to use threads (assuming workloads that need a higher level of integration than processes), so I was already exposed to sharing and synchronization, and as I was working things out, it got clearer that this was one of those things where the worst is getting from one thread to more than one. I was already in it, why not go all the way?

One of the things that didn't appeal to me in threads was getting preempted. It turns out that when you're not too greedy, you get rewarded! A single-threaded, event-driven program is very busy, because it always finds something interesting to do, and when it's really busy, it tends to exhaust its time slice. With a blocking I/O, thread-per-request design, most servers do not overrun their time slice before running into another blocking point. So in practice, the state machine that I tried so hard to implement in user-space works itself out, if I don't eat all the virtual memory space with huge stacks. With futexes, synchronization is really only expensive in case of contention, so that on a single-processor machine, it's actually just fine too! Seems ironic, but none of it would be useful without futexes and a good scheduler, both of which we only recently got.

There's still the case of CPU intensive work, which could introduce trashing between threads and reduced throughput. I haven't figured out the best way to do this yet, but it could be kept under control with something like a semaphore, perhaps? Have it set to the maximum number of CPU intensive tasks you want going, have them wait on it before doing work, post it when they're done (or when there's a good moment to yield)...

omnifarious is right about being careful about learning from what others have done. Clever use of shared_ptr and immutable data can be used as a form of RCU, and immutable data in general tends to make good friends with being replicated (safely) in many places.

One of the great ironies of this, in my opinion, is that Java got NIO almost just in time for it to it to be obsolete, while we were doing this in C and C++ since, well, almost forever. Sun has this trick for being right, yet do it wrong, it's amazing!

Current Mood: contemplative
Current Location: Montreal, Canada

Saturday, I attended BarCampMontreal3, which was quite fun. I figured that I should really practice my presentation skills, so Thursday, when I found out it was this Saturday (not the next one as I has thought!), I had to find something to talk about.

I figured there would be a lot of web developers in the audience, and having noticed that a lot of web application platforms tend to disable many HTTP features that helped the web scale to the level it has today, I thought I could share a few tips on how to avoid busting bandwidth caps, deliver a better user experience and overall try to avoid getting featured on uncov.

It was well received, mostly (see the slides), although it felt a bit like a university lecture for some (maybe the blackboard Keynote theme didn't help, and I was also one of the few with a strictly educational presentation that was also technical). Marc-André Cournoyer writes that just one simple trick visibly improved his loading time, so it's not just for those who get millions of visitors! Since at least one person thought that, I guess I should clarify or expand on a few things...

( Read more... )

Current Mood: geeky

Last weekend, I got my residency permit turned down, which, to make a long story short, means that we'll be heading back to Canada. Seems like I was misdirected by the Consulat de France in Montreal, and from what I hear, it seems to be something they've done a few times (

azrhey worked in a place here where they hire a lot of foreigners, due to language skills).

So, it looks like I'm going to be looking for a job back in Montreal.

My weapons of choice are C++ and Perl, but being a Unix/Linux hacker, of course, I am not limited to those, they're just the ones I'm most deadly with. I am comfortable with meta-programming (mostly, but not limited to that of C++ templates), continuations/coroutines, closures, multithreading, as well as event-driven state machines. I am quite effective at code refactoring, particularly in strongly typed languages, where I can use the typing system to my advantage.

I am deeply intimate with Unix/Linux, mainly in the area of network programming (sockets, networking protocols, other forms of IPC). On Linux, I am quite familiar with a number of the high-performance APIs. I have a deep knowledge of the HTTP protocol (and some of its derivatives). I have experience writing Apache modules. I know the difference between bandwidth and latency (and wish more people did too). I have some experience with developing distributed software. I have a higher-than-average knowledge of ELF and Mach-O binary formats, particularly of how symbol resolution works. I know a good deal about component software (dynamically loading modules, for example), and ABI stability issues. While I am not a master at it, I have some Linux kernel development experience as well. I know what make is doing, and why.

Finally, I also have some experience doing project and release management, where I feel I did a pretty good job, and would certainly like to do more of it. I am familiar with the free and open source software community, belonging to a number of projects, including some that were part of my work.

Current Mood: determined
Current Music: Interpol - The New

lkcl posted a reply to my previous post...

Oh, lkcl, you're a bit of a pessimistic, there. unlink is also atomic, not just rename. ;-)

But what I'm looking after right now isn't just a general message-passing operation (I already do that at a higher level, between various subsystems, using shared memory for local or sockets for remote recipients), more specifically the best way to take advantage of multiple cores as much as possible to process file descriptor events.

You have to cut the libc people some slack, though, they don't have locks around every single functions. I'd be amazingly appalled if strncmp() took a lock, for example! And as I described, I already used processes and shared-memory for higher-level concurrency, this threading is only for the very lowest level, when I have no choice and I'm pushed to the edge. Note that the code is already written as a state machines, using the threads to run more than one state machine at once (there is one state machine per connection or so, more or less). I'm planning on having a lock on the state machine instance, so that a single state machine cannot be executed on more than one thread at once, so that the code inside of it can make that assumption (that still allows me to handle multiple independent connections at once).

On Linux, epoll already does that in an atomic way, for me, in its edge-triggered mode. You can just have multiple threads call epoll_wait(), a given event will be sent to only one thread, until it is re-armed. But the specific requirement here is "compatibility layer for portability to other POSIX platforms", hence the use of crummy old select(). Of course, if I need high-load scalability on some big iron, I'll be sure to not use that layer, and go for epoll, kqueue() or something like that!

That said, I'd like a link to that message-passing work, it sounds interesting.

Current Mood: geeky

I must be very silly, but I just realized that a Unix pipe is a semaphore. It's better in some aspects (is select()able) and worse in others (SEM_VALUE_MAX is lower). Cool.

In that context, the "signal handler that writes to a pipe" trick makes a new kind of sense (the semaphore equivalent, sem_post(), is the only synchronization function that is safe to use in a signal handler).

Current Mood: geeky
Current Location: Colomiers, France
Current Music: Jean Leloup - Isabelle

Here's the challenge: a select()-based event dispatcher that can scale from single-threaded to multi-threaded efficiently, preferably as simple as possible and able to be effectively edge-triggered (even if select() itself is level-triggered).

( Read more... )

Anyone has suggestions for improvements? They are most welcome!

Current Mood: geeky
Current Location: Toulouse, France

A good while ago, I declared my preference for edge-triggered events (for I/O), because in my mind, it most closely paralleled the actual workings of the hardware (IRQs being triggered when new data arrives on network interfaces or when a read/write request to disk is completed). I also figured that since it was possible to emulate level-triggered efficiently on top of edge- (that's what happens when you use close-to-the-metal "abstractions"), it probably was the more flexible choice.

( Read more... )

Current Mood: geeky
Current Location: Toulouse, France

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29

Pierre Phaneuf's Diary

Entries tagged with scalability

Following Up On The End Of The World

A Few More Notes on HTTP

Reboot

Message passing? Yes!

Insight of the day

select() and a thread pool

Edge- Versus Level-Triggered Events

Profile

February 2016

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags