[Resolved] [Python] Parallel, fully independent tasks

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
LucasBrown
Posts: 299
Joined: Thu Apr 15, 2010 2:57 am UTC
Location: Poway, CA

[Resolved] [Python] Parallel, fully independent tasks

Postby LucasBrown » Tue Mar 18, 2014 11:39 am UTC

I'm writing a function in Python to factor large numbers by running several algorithms in parallel (Pollard's Rho algorithm, Lenstra's elliptic curve method, and the multiple-polynomial quadratic sieve. I've defined those as pollardRho(), ecm(), and mpqs(). Each takes the number to be factored as its only argument.).

What I want to do is this:
  • Feed n, the number to factor, into parafac(), the parallel factoring function
  • That function executes pollardRho(n), ecm(n), and mpqs(n) in parallel and waits for one of them to return a factor. Let's say that mpqs finishes first and returns f.
  • parafac kills pollardRho and ecm and returns f.
What I have is this:

Code: Select all

def parafac(n):
    # Run pollardRho(n), ecm(n), and mpqs(n) in parallel.
    # Return the return value of the first function to finish; kill the other processes.
   
    raise NotImplementedError    # TODO

What I'm getting hung up on is how to do that parallelization. Multithreading? Multiprocessing? Concurrent.futures? How do I obtain the return values, which these parallelizing libraries appear to discard?

If it helps, here's my sysinfo:
Spoiler:
  • Linux Mint 16
  • Python 2.7
  • Processor: Intel Core i7 X990 — 12 cores, and I'm not afraid to use 'em
  • 24 GB RAM, and I'm not afraid to use it either.
Last edited by LucasBrown on Thu Apr 03, 2014 1:42 am UTC, edited 2 times in total.

heatsink
Posts: 86
Joined: Fri Jun 30, 2006 8:58 am UTC

Re: [Python] Parallel, fully independent tasks

Postby heatsink » Tue Mar 18, 2014 5:56 pm UTC

You're most likely using CPython, which runs on a single hardware thread, so multithreading won't give you parallelism. Perhaps someone else can point out useful high-level Python libraries. If those don't fit your needs, you can use the subprocess module to launch parallel processes that communicate through pipes.

User avatar
LucasBrown
Posts: 299
Joined: Thu Apr 15, 2010 2:57 am UTC
Location: Poway, CA

Re: [Python] Parallel, fully independent tasks

Postby LucasBrown » Tue Mar 18, 2014 10:35 pm UTC

Of course it gives parallelism! That's what the whole threading and multiprocessing modules are for!

User avatar
Xenomortis
Not actually a special flower.
Posts: 1448
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: [Python] Parallel, fully independent tasks

Postby Xenomortis » Tue Mar 18, 2014 10:38 pm UTC

Unfortunately, the design of CPython means it is not truly parallel. Any multi-threading you do with it will run as though it were running on a single core processor.
If you want parallelism using CPython, you have to use multiple processes.
Image

User avatar
Diemo
Posts: 396
Joined: Mon Dec 03, 2007 8:43 pm UTC

Re: [Python] Parallel, fully independent tasks

Postby Diemo » Tue Mar 18, 2014 10:56 pm UTC

You probably want to use PP. I have not used it myself, but someone else in our cluster did and he says that up to 8 threads it is fine (provided this is what he was talking about of course)
In the beginning the Universe was created.
This has made a lot of people very angry and been widely regarded as a bad move.
--Douglas Adams

User avatar
thoughtfully
Posts: 2253
Joined: Thu Nov 01, 2007 12:25 am UTC
Location: Minneapolis, MN
Contact:

Re: [Python] Parallel, fully independent tasks

Postby thoughtfully » Wed Mar 19, 2014 1:05 am UTC

The batteries-included solution is to use the multiprocessing module, which has an API very similar to the threading module, but runs threads in seperate processes. PP looks like it may have some additional features, so that could also be a good choice depending on your needs.

For applications that are not computationally-bound, consider tolerating the GIL. The cost isn't necessarily a high one.
Image
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
-- Antoine de Saint-Exupery

wumpus
Posts: 546
Joined: Thu Feb 21, 2008 12:16 am UTC

Re: [Python] Parallel, fully independent tasks

Postby wumpus » Wed Mar 19, 2014 8:20 pm UTC

multiprocessing appears to be what you are asking for. It has a few problems:
1. Starting a multiprocessing thread requires an entire separate python interpreter. You might not be happy killing and starting new threads all the time.
2. Multiprocessing appears to be tied to CPython. Heavy numerical programming might be more suited to pypy (that is pypy's big strength, could easily outrun CPython by the required factor of 3 before overhead issues). Numpy (or swigged C code) could swing things back to CPython.
3. If you use multiprocessing, my guess is that you would simply have three separate processes running the factorization routines, and send the number to be factored via a queue. Somewhere near the inner loop, each process would have to look for a new item on the queue. If it found one, it would stop all previous factoring and factor the new number. My first idea was to use both threading and multiprocessing, with each process having a master thread (which looks at the queue and restarts the worker as needed). It isn't quite clear how safe this is "http://stackoverflow.com/questions/12984003/status-of-mixing-multiprocessing-and-threading-in-python"

Finally, how is this somehow not an example of premature optimization? Are the three routines sufficiently complementary that you would run all three without any parallelism? I suspect either scipy (numpy) would be best if using "matlab"-type techniques, and pypy otherwise. Running all three in parallel via multiprocessing is trivial. The stopping (efficiently) bit will get tricky.

User avatar
LucasBrown
Posts: 299
Joined: Thu Apr 15, 2010 2:57 am UTC
Location: Poway, CA

Re: [Python] Parallel, fully independent tasks

Postby LucasBrown » Thu Mar 20, 2014 2:14 am UTC

@wumpus: Thanks! That was quite helpful. My code is now

Code: Select all

def multifactor(n):
    # Note that the multiprocessing incurs overhead to the tune of 0.1s.
    # This is generally negligible compared to the time to factor the sort of numbers this is fed
    # and the time already spent in the first two phases of the factoring.
   
    # I'm defining a function inside another function.  Something's wrong with me.
    def factory(method, n, output):
        output.put(method(n))
   
    facs = mp.Queue()
   
    p = mp.Process(target=factory, args=(pollardRho_brent, n, facs))
    e = mp.Process(target=factory, args=(ecm, n, facs))
    m = mp.Process(target=factory, args=(mpqs, n, facs))
   
    p.start(); e.start(); m.start()
   
    while facs.empty(): sleep(0.01)
   
    p.terminate(); e.terminate(); m.terminate()
   
    return facs.get()

And as for the premature optimization... this parallel function is called after the number of interest has already been chugged on for several million rounds of trial division and Pollard's Rho algorithm. What remains in the unfactored portion of the number should therefore be only large primes (which ecm does a rather good job of extracting) and very large primes (which is mpqs's specialty). I'm including pollardRho_brent in the mix just in case a small factor or two somehow escaped the first few million rounds (which occasionally happens in my testing). I would call these without parellelism, but to do so effectively would involve determining the size of the factors of the input beforehand — and the only real way to do that is to just factor the number, so we're back where we started.

Also, thanks for pointing out PyPy. It cut the time on my benchmarks by a factor of 10!

lvc
Posts: 20
Joined: Mon Jun 15, 2009 5:27 am UTC

Re: [Resolved] [Python] Parallel, fully independent tasks

Postby lvc » Fri Mar 21, 2014 8:49 am UTC

LucasBrown wrote:What I'm getting hung up on is how to do that parallelization. Multithreading? Multiprocessing? Concurrent.futures? How do I obtain the return values, which these parallelizing libraries appear to discard?


concurrent.futures is actually really really good at things like this - in fact, it's basically what it exists for. It abstracts away the queue management you've coded manually (which wouldn't be needed if you didn't care about the results). I would code your case roughly like this (untested):

Code: Select all

from concurrent import futures

def parafac(n):
    # Run pollardRho(n), ecm(n), and mpqs(n) in parallel.
    # Return the return value of the first function to finish; kill the other processes.
     
    with futues.ProcessPoolExecutor() as executor:
        pending = [executor.submit(func, n) for func in (pollardRho, ecm, mpqs)]
        result = next(futures.as_completed(pending)).result()
        for future in pending:
            future.cancel()
        return result

wumpus
Posts: 546
Joined: Thu Feb 21, 2008 12:16 am UTC

Re: [Python] Parallel, fully independent tasks

Postby wumpus » Fri Mar 21, 2014 8:09 pm UTC

LucasBrown wrote:@wumpus: Thanks! That was quite helpful. My code is now
... and the only real way to do that is to just factor the number, so we're back where we started.

Also, thanks for pointing out PyPy. It cut the time on my benchmarks by a factor of 10!


My idea was to make it work first, then add the tricky stuff. It was pretty clear that such things might not be an option, but you hadn't made it clear that you already knew it would be a slow routine (so the overhead wouldn't be as important). Still, with a x10 improvement you could probably run all three sequentially :) before trying your parallelism (I imagine you are running at least ~20 threads now that it works).

User avatar
LucasBrown
Posts: 299
Joined: Thu Apr 15, 2010 2:57 am UTC
Location: Poway, CA

Re: [Resolved] [Python] Parallel, fully independent tasks

Postby LucasBrown » Sat Mar 22, 2014 3:45 am UTC

lvc wrote:concurrent.futures is actually really really good at things like this - in fact, it's basically what it exists for. It abstracts away the queue management you've coded manually (which wouldn't be needed if you didn't care about the results). I would code your case roughly like this (untested):

Code: Select all

from concurrent import futures

def parafac(n):
    # Run pollardRho(n), ecm(n), and mpqs(n) in parallel.
    # Return the return value of the first function to finish; kill the other processes.
     
    with futues.ProcessPoolExecutor() as executor:
        pending = [executor.submit(func, n) for func in (pollardRho, ecm, mpqs)]
        result = next(futures.as_completed(pending)).result()
        for future in pending:
            future.cancel()
        return result

I tried that, but when one subprocess finished, I wasn't able to cancel the other two subprocesses (the .cancel() call would return False for all three processes — the completed one returns False because it's already done and the other two would just refuse to cancel. The other early termination calls would also fail to terminate early.) and instead had to wait for all three to finish, which is exactly what I'm trying to avoid. When chugging on (for example) 38! + 1, mpqs finishes in about 20 seconds or so while pollardRho_brent and ecm could take days.

lvc
Posts: 20
Joined: Mon Jun 15, 2009 5:27 am UTC

Re: [Resolved] [Python] Parallel, fully independent tasks

Postby lvc » Sat Mar 22, 2014 11:29 am UTC

Ok, missed the part in the docs that says "A Future cannot be cancelled if it is running or has already completed". So, yeah, no that won't work. On the other hand, calling executor.shutdown(wait=False) should, as long as you are using a ProcessPoolExecutor rather than a ThreadPoolExecutor. A ProcessPoolExecutor kills all of its worker processes when you do that, while a ThreadPoolExecutor doesn't try to kill the threads - it just notifies the workers to exit (which they don't check while they're running a task). So, the threads get signalled to shut down, the method returns without waiting for them to pay attention to it.. and then you exit the context manager, which ends up calling executor.shutdown(wait=True). But that could explain why that kind of approach didn't work for you - it doesn't with Threads, but it does with processes.

So, this should actually work:

Code: Select all

from concurrent import futures

def parafac(n):
    # Run pollardRho(n), ecm(n), and mpqs(n) in parallel.
    # Return the return value of the first function to finish; kill the other processes.
     
    with futues.ProcessPoolExecutor() as executor:
        pending = [executor.submit(func, n) for func in (pollardRho, ecm, mpqs)]
        result = next(futures.as_completed(pending)).result()
        executor.shutdown(wait=False)
        return result


It's also a line shorter than my last attempt, which is kindof nice.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 7 guests