Disqus Comments

Cláudio Tereso • 10 years ago

Great article. Looking forward to see more developments on this topic. Thanks.

Jayanth • 11 years ago

Very well written and easy to understand. I was looking forward for this. Thank you

Sebastian Raschka • 11 years ago

Thank you, I am really glad to hear!

Sai Lakshmana • 10 years ago

One of the best articles on using pythons multiprocessing module. Very easy to follow and apply it. Thanks for the wonderful work.

Nelson Carrasquel • 10 years ago

Finally, I have found a guide on python parallel for dummies, I found some other articles, but this one was really the easiest to understand, congratulations, P.S: Sorry for my messy english

rn27in . • 10 years ago

Thanks Sebastian for your wonderful article. When I run the cube function with multiprocessing using Pool, it does not recognize the function cube defined globally. Even if I use __name__ = '__main__', it does not work. I get this error. self.run(), self._target(*self._args, **self._kwargs), task = get(), return recv(). Is there a way I can run this program? Thanks

orangleliu • 10 years ago

great ！ thank you

Ryan Foley • 10 years ago

This is exactly what I've been looking for. Thanks a bunch!

Yaonan Zhong • 10 years ago

Great article explaining the difference between apply, apply_async, map and map_async. Thank you!

James bale • 11 years ago

Great!!!

Bhishan Poudel • 9 years ago

Very nice examples, will you also provide how to parallize two functions which output will appear in order?

Sebastian Raschka • 9 years ago

Thanks! Do you mean 2 different functions that need to be executed in order? E.g.,
out_1 = process(x)
out_2 = process_output_of_process(out_1)
?

Benny Elgazar • 9 years ago

Hard for me to understand the concept since With simple cube test I get serial flow working better by far.
code example:

def cube(x):

return x ** 3

def s1(n):

s = time.time()

pool = mp.Pool(processes=3)

results = [pool.apply(cube, args=(x,)) for x in range(1,n)]

e = time.time()

print e-s

def s2(n):

s = time.time()

results = [cube(x) for x in range(1,n)]

e = time.time()

print e-s

if __name__ == '__main__':

parser = argparse.ArgumentParser()

parser.add_argument("-l", "--length", type=int, required=True,

help="requested password length")

user_input = parser.parse_args()

s1(user_input.length)

s2(user_input.length)

what am i doing wrong.

Christoph • 9 years ago

Dear Sebastian, i've altered your program slightly to the following code:

def rand_string(length, output): # define an example function.
""" Generates a random string of numbers, lower- and uppercase chars. CPU-heavy"""
rand_str = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for i in range(length))
output.put(rand_str) # Put the strings into the queue.

if __name__ == "__main__": # The program doesn't work without the test-block!
N = 10000 # Number of random signs to add
Processes = 4 # Number of processes to create
OutputQueue = multiprocessing.Queue() # Define an output queue.
processes = [multiprocessing.Process(target=rand_string, args=(N, OutputQueue)) for x in range(Processes)] # Setup a list of processes that we want to run.
for p in processes:
p.start() # Run processes.
for p in processes:
p.join() # Exit the completed processes.
results = [OutputQueue.get() for p in processes] # Get process results from the output queue.
print(results)

For N=1000 it works well, however for N=10000 it doesn't print any results and the shell gets paralysed. In my task manager, I see that the several python-processes are initialised but they don't do anything.

I guess that the problem is the rand_string-function, because if I use another function instead of rand_string, the issue doesn't occur.

Christoph • 9 years ago

I just found out, that everything works well without problems, if one exchanges the line-block

for p in processes:
p.join() # Exit the completed processes.
results = [OutputQueue.get() for p in processes] # Get process results from the output queue.
print(results)

results = [OutputQueue.get() for p in processes] # Get process results from the output queue.
print(results)
for p in processes:
p.join() # Exit the completed processes.

But it is still not understandable for me, why it works now.

ankit agrawal • 9 years ago

Dear Sebastian, Thank you for sharing the blog with us. Its nicely written and well explained. I was trying to implement the logic of this code into my problem. My Problem is that i have to write each calculation (in your code parzen_estimation) in a file. When i am trying to write final results in files that takes hours because of large data size. Is that possible i write the file inside the calculation and return 0. So that i can save the memory and time.

Sebastian Raschka • 9 years ago

Hi, Ankit,

thanks for the comment and the nice words!

About your little problem: Sorry, but I am not sure if I understand your problem correctly. Are you trying to run the exact same code that I used in the blog article or are you trying to run similar code on a larger dataset that may cause the large runtime (the code & data in the blog article should only take seconds to complete; minutes in case of the benchmark if I remember correctly). Here is a link to the IPython notebook so that you don't have to copy and paste it from the blog if it helps: http://nbviewer.ipython.org...

ankit agrawal • 9 years ago

Hi, Thanks for your kind support. I was trying to run similar code on a larger dataset. The main for loop (in your case widths) is order of 512000. In the main calculation there are two nested for loop outer run for 50 and inner run for 1000. The inner calculation is dependent on w (in your case for w in widths). So my serial code takes more than 60 hours to compute such things. When i convert into parallel and run with four processor that takes more than 24 hours where 3 hours takes writing the final answers into file.
I apply a new method where i divided the main loop 512000 into 4 parts and run 4 code parallel in my machine. and after completing i concatenate all the outputs into single file. This whole process was done in 12 hours in i5 machine. I like the previous method of doing this calculation but i don't want 3 hours to write the output into the files. Because during these time machine become too much slow and i can't do anything else.
Is there anyway to save that time so that machine couldn't become slow? Thanks

Sebastian Raschka • 9 years ago

Hi, Ankit,
there are many different option you could try. I think your best bet is to write this data to SSD instead of HD. HDs are really useful due to their large storage capacities/price. However, it can be helpful to also put and SSD into your machine for I/O heavy computations.

But there are also some tricks you can apply in Python (if you haven't done so yet). You could write the output in byte format, e.g., instead of

with open('my_file.txt', 'w') as outfile:
...

you could try

with open('my_file.txt', 'wb') as outfile:
...

Also, you could try to increase the buffer size so that the output is written less frequently. By default, Python uses the OS defaults (typically, it is flushing on every new line), and it may be worthwhile to increase the buffer size. For example, to increase the buffer size to 1 MB, you could try:

with open('my_file.txt', 'wb', 1048576) as outfile:
...

Hope that this is somewhat helpful!

Best,
Sebastian

Junhao Li • 9 years ago

Thank you for sharing the knowledge, it is clear and enable me to apply to my own project

Sebastian Raschka • 9 years ago

Cool, nice to hear! Thanks for the feedback!

Phil Schiffer • 10 years ago

Very cool intro. Much better to understand than the other ones I saw.

Rodrigo • 10 years ago

this will be really useful in my work, thanks a lot!!

Sai Lakshmana • 10 years ago

I am curious to know, if there is an easy way to get return code after the parallel processing work is done. In order to start a different python script?

Gabriel Lema • 10 years ago

Thank you for this tutorial! What does it mean if I can't seem to complete the first part? http://sebastianraschka.com...

It doesn't return any result..It looks as it got locked. I'm using Anaconda, with Spyder IDE, and Ipython terminal.

Thanks!

Gabriel Lema • 10 years ago

Nvmd, I just added the if __name__ = '__main__" and now it works.

Gee • 10 years ago

Still no result for me. Something looks wrong with output.get(). Great article though. Thank you very much.