PHP threading question

airforcematt

New member
Dec 22, 2006
1,383
45
0
Hey ya'll coder type peoples - got a question that I'm pretty sure someone here can help out with. I'm having an app developed that will spider "certain things" from "certain sites" - currently going through the bidding process on odesk and have a guy that seems to know what he's talking about bidding on my project.

Oddly enough he's the only one suggesting the use of PHP/CURL as the basis for the app (most are talking C/.net/windows based etc) I know enough about PHP to know it doesn't natively support multi-threading which is one of the crucial things for my app. I asked him if PHP would work for multi threading.

This was his reply
Hi
You are right, there is no built-in support for multithreading but we can achieve it by splitting the task to multiple jobs(files) and a separate php process will run each job simulataneously

So - is he full of shit or can it actually function well like this? For some reason my BS alarm is going off on this guy.
 


Yes is can be done and done well, but it can also turn into a complete clusterfuck too. It depends on the ability of the coder.

I know this doesn't really answer the question, but it comes down to the programmers ability and without knowing that, no one can say if his/her solution would be acceptable.

I would ask for a sample of his/her threading code and then go from there if you're considering using them.
 
C/.net/windows are a terrible choice since they won't easily give you more than one core to work with. PHP is totally doable, and can very easily handle a shitload of requests. Gimme a few minutes and I'll show a running example...
 
Curl as exposed in PHP supports nonblocking operations which is probably a fine alternative to multithreading for what you're describing. It's possible to simulate multithreading with PHP with nonblocking program execution, or using unix-style process control mechanisms like fork.

That said, long-lived processes and asynchronous operations aren't really the language's strong suits, and as spider-sorts of applications tend to benefit from those things almost anything else would be better. I'd probably pick c# 2.5/3.0 as it's a diddle to write that sort of thing and you can use Windows or the Mono runtime.
 
It's not actually "threading" but you can manipulate your code to do a ton of stuff in parallel on PHP. Between CRON jobs and ajax tricks, it's definitely possible.
 
Yes is can be done and done well, but it can also turn into a complete clusterfuck too. It depends on the ability of the coder.

I know this doesn't really answer the question, but it comes down to the programmers ability and without knowing that, no one can say if his/her solution would be acceptable.

I would ask for a sample of his/her threading code and then go from there if you're considering using them.

Cool, thanks for the suggestion.

C/.net/windows are a terrible choice since they won't easily give you more than one core to work with. PHP is totally doable, and can very easily handle a shitload of requests. Gimme a few minutes and I'll show a running example...

Thanks - was leaning towards windows solutions till you posted this.

Curl as exposed in PHP supports nonblocking operations which is probably a fine alternative to multithreading for what you're describing. It's possible to simulate multithreading with PHP with nonblocking program execution, or using unix-style process control mechanisms like fork.

That said, long-lived processes and asynchronous operations aren't really the language's strong suits, and as spider-sorts of applications tend to benefit from those things almost anything else would be better. I'd probably pick c# 2.5/3.0 as it's a diddle to write that sort of thing and you can use Windows or the Mono runtime.

Would I be correct in guessing that's what Coder #2 meant when he referred to "php/multicurl" as his solution to my threading needs?

Appreciate all the advice/feedback guys :)

PS, if any of ya'll are looking to do some coding drop me a PM and we can talk specifics. I'm not locked into using el-random freelancers for this - would rather hook a WF bro up if the pricing is right + I'm pretty sure some of ya'll have a 90% solution for what I'm wanting to do.
 
So here is an example, I've got a very simple script that takes a large list of domains, finds all of the IP's associated with the A record of those domains, and spits them back into a file. I have one control process running and I get this:

root@mail:/root/tempdns# php grab.php
Testing DNS Servers
DNS Test Complete
Found: 3290 - Not Found: 1622 - Timeout: 35
February 11, 2011, 11:32 pm processed 5000 records in 24 seconds 208.33333333333/sec last 002tk.com
Found: 3730 - Not Found: 1180 - Timeout: 38
February 11, 2011, 11:32 pm processed 10000 records in 43 seconds 232.55813953488/sec last 008casino.com
Found: 3396 - Not Found: 1533 - Timeout: 21
February 11, 2011, 11:33 pm processed 15000 records in 61 seconds 245.90163934426/sec last 010bjr.com
Found: 3811 - Not Found: 1141 - Timeout: 5
February 11, 2011, 11:33 pm processed 20000 records in 78 seconds 256.41025641026/sec last 019u.com
Found: 3846 - Not Found: 1107 - Timeout: 6
February 11, 2011, 11:33 pm processed 25000 records in 96 seconds 260.41666666667/sec last 020webseo.com
Found: 3607 - Not Found: 1345 - Timeout: 8
February 11, 2011, 11:34 pm processed 30000 records in 115 seconds 260.86956521739/sec last 023taobao.com
Found: 3573 - Not Found: 1387 - Timeout: 6
February 11, 2011, 11:34 pm processed 35000 records in 133 seconds 263.15789473684/sec last 028msyc.com
Found: 3604 - Not Found: 1354 - Timeout: 9
February 11, 2011, 11:34 pm processed 40000 records in 152 seconds 263.15789473684/sec last 0325721609.com
Found: 3609 - Not Found: 1334 - Timeout: 25
February 11, 2011, 11:35 pm processed 45000 records in 173 seconds 260.11560693642/sec last 0411sj.com
Found: 3611 - Not Found: 1337 - Timeout: 23
February 11, 2011, 11:35 pm processed 50000 records in 193 seconds 259.06735751295/sec last 05005.com

CPU usage goes between 2-8% with the top line normally looking like this

Cpu(s): 6.3%us, 17.8%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st

Interface traffic shows averages of around ~150K/sec sent and ~350K/sec received for a total of ~500K/sec of small packets.

I could very easily split the control process also up to around 6 times with a linear increase of requests.
 
As for whether PHP is a good choice... if the coder is any good then it's a perfectly good choice. If the coder doesn't know how to manage his memory well, or thinks multicurl is the fastest way to use curl, then I wouldn't work with him.

If you want a safe bet, any half decent Python coder should be able to do a nice job without any issues. Worst comes to worse, it shouldn't be hard to find a coder to make any fixes later if the original coder bails on you.
 
As for whether PHP is a good choice... if the coder is any good then it's a perfectly good choice. If the coder doesn't know how to manage his memory well, or thinks multicurl is the fastest way to use curl, then I wouldn't work with him.

If you want a safe bet, any half decent Python coder should be able to do a nice job without any issues. Worst comes to worse, it shouldn't be hard to find a coder to make any fixes later if the original coder bails on you.

Thanks for the advice\example - appreciate it. Any thoughts/advice as to using perl for something like this? Have someone pitching me that as a solution and the thought of having a server run + windows capable app is interesting.
 
Thanks for the advice\example - appreciate it. Any thoughts/advice as to using perl for something like this? Have someone pitching me that as a solution and the thought of having a server run + windows capable app is interesting.

Python will run fine on windows, all of those little apps mattseh makes are done in Python.

I used to be a big fan of Perl. I was very resistant to switch to PHP and at the time PHP really did suck. At this point in time there isn't a huge speed difference between the two. That said, Perl code sucks to fix later, and it hasn't kept up with all the little shortcuts of any other modern language. I wouldn't pay for Perl code, don't care who is writing it.
 
  • Like
Reactions: airforcematt
I've tried to build threading shit in php and it's a pain in the ass. The idea of a controlling script with a while loop that fires off a process is probably the best you'll get.

FWIW the project I worked on took uploaded videos and fired off encoding processes for them. Boss wanted a theaded daemon thing, but after working for days with pthreads in php it turned out that a pretend daemon to control the process was faster to implement, more stable and easier to troubleshoot.
 
So here is an example, I've got a very simple script that takes a large list of domains, finds all of the IP's associated with the A record of those domains, and spits them back into a file. I have one control process running and I get this:

root@mail:/root/tempdns# php grab.php
Testing DNS Servers
DNS Test Complete
Found: 3290 - Not Found: 1622 - Timeout: 35
February 11, 2011, 11:32 pm processed 5000 records in 24 seconds 208.33333333333/sec last 002tk.com
Found: 3730 - Not Found: 1180 - Timeout: 38
February 11, 2011, 11:32 pm processed 10000 records in 43 seconds 232.55813953488/sec last 008casino.com
Found: 3396 - Not Found: 1533 - Timeout: 21
February 11, 2011, 11:33 pm processed 15000 records in 61 seconds 245.90163934426/sec last 010bjr.com
Found: 3811 - Not Found: 1141 - Timeout: 5
February 11, 2011, 11:33 pm processed 20000 records in 78 seconds 256.41025641026/sec last 019u.com
Found: 3846 - Not Found: 1107 - Timeout: 6
February 11, 2011, 11:33 pm processed 25000 records in 96 seconds 260.41666666667/sec last 020webseo.com
Found: 3607 - Not Found: 1345 - Timeout: 8
February 11, 2011, 11:34 pm processed 30000 records in 115 seconds 260.86956521739/sec last 023taobao.com
Found: 3573 - Not Found: 1387 - Timeout: 6
February 11, 2011, 11:34 pm processed 35000 records in 133 seconds 263.15789473684/sec last 028msyc.com
Found: 3604 - Not Found: 1354 - Timeout: 9
February 11, 2011, 11:34 pm processed 40000 records in 152 seconds 263.15789473684/sec last 0325721609.com
Found: 3609 - Not Found: 1334 - Timeout: 25
February 11, 2011, 11:35 pm processed 45000 records in 173 seconds 260.11560693642/sec last 0411sj.com
Found: 3611 - Not Found: 1337 - Timeout: 23
February 11, 2011, 11:35 pm processed 50000 records in 193 seconds 259.06735751295/sec last 05005.com

CPU usage goes between 2-8% with the top line normally looking like this

Cpu(s): 6.3%us, 17.8%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st

Interface traffic shows averages of around ~150K/sec sent and ~350K/sec received for a total of ~500K/sec of small packets.

I could very easily split the control process also up to around 6 times with a linear increase of requests.

Your brain is sexy, need a job?
 
C/.net/windows are a terrible choice since they won't easily give you more than one core to work with. PHP is totally doable, and can very easily handle a shitload of requests. Gimme a few minutes and I'll show a running example...

I generally agree with what you're saying, and I'm not hating on PHP (for once ;)) but I don't think you're correct about .NET being limited to one core? Shouldn't matter for this anyways, as I'm sure it'll be network, not CPU limited.

Matt, async is my middle name, if you're still looking for someone, hit me up :)