Request scripts

Status
Not open for further replies.

wesley

New member
Jun 28, 2006
438
4
0
Hey

I'm pretty good with php/mysql so if anyone wants some simple script for whatever, just post your request here and if I have some time I might make it. Nothing too big though, and no promises either.

I can do php & cocoa.
 


Could you write a script that would validate a list of RSS feeds? As in, determine if the URL is a.) valid and working b.) an RSS feed?

Thanks,
-Chris
 
Alright, just made it, haven't tested it alot so let me know if it works.

It works like this:

1) Upload to your server
2) type rss_validation.php?url=http://www.example.com/rss.php to validate that single feed.
3) if you want to validate a list of urls, upload a file called urls.txt to the same location as the script, each url should be on a separate line. Then go to rss_validation.php (without the query ?url=..)

Note, I haven't tested the script with urls.txt so please report back to me on that and any errors that you find.

Also, the validation of rss is pretty simple, it basically just searches for <rss> or <feed> or <rdf:RDF> so the rss could still be screwed up but that would be a chance of 1/1000. Let me know if that is enough for you.
 

Attachments

Thanks for doing this! The single url validation works great it seems but I have a list of around 400,000 feeds I need to validate :(

I tried breaking it up into a list size as small as 5 per txt file and it hangs on the rss_validation.php page. I do eventually get this error though:

Warning: file_get_contents(Moreover Technologies - Search results for... Post) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in /home/cnotes/public_html/rss/rss_validation.php on line 64
Array
(
[0] => Array
(
=> http://p.moreover.com/cgi-local...ata. ) Any ideas? Thanks, -Chris )
 
I'll look into it tomorrow morning, this time I'll test it with urls.txt :)

I'll make sure it can be run in the background (so you can close the page while the work is done (on such a big list, it's bound to take a lot of time), and then maybe email the results or whatever).
 
I'll look into it tomorrow morning, this time I'll test it with urls.txt :)

I'll make sure it can be run in the background (so you can close the page while the work is done (on such a big list, it's bound to take a lot of time), and then maybe email the results or whatever).

If the results could be a filtered list of just the validated URLS and maybe stats as to how many there were total, how many were valid, and how many were invalid RSS feeds, and how many were not valid at all.

That would be incredible!!!

Thanks so much for your help. I had no idea how I was going to do this!
 
The trouble is going to be the CAPTCHAs - do you have any ideas so far? Oh and it would be cool if you built in a system to check confirmation emails. :)
 
Yeah, a forum poster is a little out of my league. As has been said, the capthca is the hardest thing, the rest not so much. If you can bypass the captcha, let me know.

Anyone else?
 
New version, please test again -- if it still doesn't work, send me a somewhat big list of rss feeds to check, I only checked with about 10.

So how does it work:

- upload your urls.txt (1 per line)
- upload 2 empty files: valid.txt and invalid.txt make sure they are chmodded 777.
- When it is done the last line in each file will be "."

Comments:
I noticed you are using file_get_contents, this probably means you have PHP4 and no curl.

I would suggest you at least install curl, and if possible install PHP5, it will still work with the setup you have, but much slower.

There are three possibilities:

1) file_get_contents: very slow, if you have php4, i can't even specify the default timeout (which is 60 seconds) -- so if a feed doesn't work, that will hang the script for 60 seconds.
2) curl: here I can specify the timeout (10 seconds connect, 20 seconds total) It will also go faster since it can read gzipped content, etc.
3) curl_multi: This requires php5 but is the best solution. It will create 20 curl instances and run them simultaneously, so things should be even speedier than option 2.

If you can't update your server then install WAMP (google it) on your PC and run the script locally. I would recommend you do this anyway.

Let me know if you get any errors and if you do, send me a list of urls to test.

Oh btw, I would indeed split that big file (400,000 urls) up in some smaller files, just to be safe.

And another thing, the script will not stop if you close the browser window, if you want it to stop you'll have to remove the line ignore_user_abort(1); Make sure you don't execute the script again before the first instance has finished. (You can see this by looking at valid.txt if the last line is just "." then the script is finished)

-- MyOwnDemon: Yes, I can help you with that. But it would involve gay sex and midgets.
 

Attachments

Hey Wesley- this latest revision looks great. I'm trying to run it on my desktop computer but it keeps saying every URL is invalid because it is an "empty document". Can you think of a reason why this would be happening? Could my localhost be unable to view the pages due to a connection issue so the script thinks its a problem?
 
Have you tried it with just a single URL? (via the query string?) It could be a setting that you don't allow outgoing connections.

Try checking your php.ini and see if you find a line that says allow_url_fopen = 0

Change it to = 1 (restart apache after that)

Anyway, it seems that you are not using curl, I would definately compile php to use that.
 
I've got a line that says:

"allow_url_fopen = On"

Is that the right setting?

On the WAMP localhost homepage, I see this:

5.2.2

Loaded extensions :
bcmath, calendar, com_dotnet, ctype, session, filter, ftp, hash, iconv, json, odbc, pcre, Reflection, date, libxml, standard, tokenizer, zlib, SimpleXML, dom, SPL, wddx, xml, xmlreader, xmlwriter, apache2handler, mbstring, curl, mysql, mysqli, PDO, pdo_sqlite, SQLite


Wouldnt that mean that mean that curl is installed?
 
Status
Not open for further replies.