Help writing a webpage download program

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Qaanol
The Cheshirest Catamount
Posts: 3069
Joined: Sat May 09, 2009 11:55 pm UTC

Help writing a webpage download program

Postby Qaanol » Wed Feb 08, 2012 10:08 pm UTC

I want to make a program that will run continuously, and every so often (say, 5 minutes) it looks at a certain webpage. If the contents of the webpage have changed, then it will save a copy with a timestamp added to the name.

I am on Mac OS X, and I have only rudimentary programming experience (well, I am fairly proficient with Matlab, but that’s definitely not the right language for this job.) I am aware of the following: the ‘wget’ command is not available in OS X, but ‘curl’ is. A program can be made to launch when the machine starts up, by putting it in the LoginItems folder or the StartupItems folder. I do not know how to do a bit-wise check for differences between files, nor do I know how to use md5 hashes.

The following can be hard-coded:
URL to check (in this case it is http://api.usno.navy.mil/imagery/moon.png, which is the image used on this page showing What the Moon looks like now)
Directory in which to save files (perhaps ‘../Archive/’)
Time interval between checking the URL (5 minutes)

Then it just has an endless loop until the program is terminated:
If time interval has elapsed since last check (or if there was no previous check) then
Fetch the file from the URL
Compare to previous versions, probably with md5 hashes
If the file changed, save a copy of it with the current time in its filename, and save its md5 hash

I have a tiny bit of experience with Objective-C, and I have Xcode installed and know the basics of how to use it, so that is one option for the language. I don’t have much experience with scripting languages, though I know Mac OS X supports several.

So I’m asking for help. What is a good language to use for this task, keeping in mind the person doing it does not yet know that language? What are the relevant commands in that language for doing the various parts of this program, and vitally, what is the exact syntax for using those commands the “right way”?

This is not in any way homework. It is something I want to do for myself, and if it serves as a stepping stone toward learning some programming or scripting language, so much the better.

Spoiler:
My experience with Matlab taught me that I learn programming languages best when I have the basics shown to me in a hand-holding sort of “type exactly this and it will do exactly that” way. Once I get the fundamental syntax, semantics, and functionality down, then I feel comfortable enough using the language that I know what to look up when I don’t know how to do something, and then I can learn and master the rest on my own. But to get off the ground I need experience doing simple things that I have been shown exactly how to do.

This may be compared to the C++ and Java classes I took in college, where there was a lot of conceptual stuff and it got to the point where I was thinking, “Yes, I understand the principle behind a doubly-linked list, and the tradeoffs between time and storage space in various algorithms, but that doesn’t tell me when I need a colon or a semicolon or parentheses or curly braces or what the built-in command is for parsing a string into an int. I can read C/C++/Java code all right, I just don’t know how to actually write useful things with them.
wee free kings

User avatar
PM 2Ring
Posts: 3715
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Help writing a webpage download program

Postby PM 2Ring » Thu Feb 09, 2012 12:49 am UTC

This can be done as a simple bash script. Alternatively, it could be done as a Python program (which should already be installed), but I think that for a basic task like this, bash is more than adequate.

The command for curl would look something like

curl -s http://api.usno.navy.mil/imagery/moon.png >moon.png

The -s option tells curl to be silent and not print a progress report, but you might want to leave that out, especially during testing.

You could use md5sum to do the file comparison, but there's really no need - you can do a direct file comparison using diff. OTOH, md5sum would be useful if you wanted to check if the current image is identical to any of those that you've already stored.

To do the time delay, you'd use the sleep command
sleep 5m

This script will be pretty elementary, so it's probably easier to present you with a complete example, rather than just giving the syntax for each command. :)

I'm curious - why do you want a copy of all those files? A month's worth of moon pictures will take up more than a gigabyte if they're changed every 5 minutes.


FWIW, I wrote a moon-phase script last year that creates fake moon images by compositing a full moon image with a greyscale mask depicting the phase.
Last edited by PM 2Ring on Thu Feb 09, 2012 3:27 am UTC, edited 2 times in total.

User avatar
zed0
Posts: 179
Joined: Sun Dec 17, 2006 11:00 pm UTC

Re: Help writing a webpage download program

Postby zed0 » Thu Feb 09, 2012 12:52 am UTC

I would suggest just using bash to do this, I believe that this come preinstalled on OS X.
You have the right idea with the thought of curl and md5 hashes, you can use `curl -o foo.png` to save to a specific file and `md5sum foo.png` to get the md5 sum of it.

I have actually just written a program to do this due to boredom, it currently works on a linux system so you may have to edit a few things but it basically works:
Spoiler:

Code: Select all

#!/bin/bash
URL='http://api.usno.navy.mil/imagery/moon.png' #source url
DESTFILE='Archive/moon.' #base name to save to
FILEEXT='.png' #file extension to save with
DELAY=3000 #delay (in seconds) between checks


while true
do
        #create a temporary file and download the current image to it
        FILE=$(mktemp moon.tmp.XXXXXX)
        curl -s -o $FILE $URL
        #get the md5 sum of the downloaded file
        NEWMD5="$(md5sum $FILE|sed -e 's/ .*//')"
        #check it against the previous md5 (if one exists)
        if [ "$NEWMD5" != "$LASTMD5" ];
        then
                #if it does then move the file to a permenant location
                NEWFILE=$DESTFILE$(date +%s)$FILEEXT
                mv $FILE $NEWFILE
                echo "New version, saved as $NEWFILE."
                LASTMD5=$NEWMD5
        else
                #otherwise remove the temporary file
                rm $FILE
        fi
        #wait until the next check
        sleep $DELAY
done
Looking at how often the image changes (around once every 30 seconds) I'm not sure if it's worth all of this or just to take a snapshot and assume it will be different.

User avatar
PM 2Ring
Posts: 3715
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Help writing a webpage download program

Postby PM 2Ring » Thu Feb 09, 2012 1:08 am UTC

Thanks, zed0; that saves me from writing it. :)

A couple of suggestions:
DELAY should be 300 or 5m.

The +%s in $(date +%s) will give the time in seconds since 1970-01-01 00:00:00 UTC. A more human-friendly format string would be
+%F_%T
or
+%F_%H-%M-%S
if you hate colons in file names.

User avatar
Qaanol
The Cheshirest Catamount
Posts: 3069
Joined: Sat May 09, 2009 11:55 pm UTC

Re: Help writing a webpage download program

Postby Qaanol » Thu Feb 09, 2012 1:54 am UTC

Thanks everybody.
wee free kings

bittyx
Posts: 194
Joined: Tue Sep 25, 2007 9:10 pm UTC
Location: Belgrade, Serbia

Re: Help writing a webpage download program

Postby bittyx » Thu Feb 09, 2012 6:50 am UTC

Also, instead of the delay and the endlessly looping program, you might want to make a script that just runs a single pass (download, compare to previous, save if different), and schedule it to run every N minutes via cron.

User avatar
thedufer
Posts: 263
Joined: Mon Aug 06, 2007 2:11 am UTC
Location: Northern VA (not to be confused with VA)
Contact:

Re: Help writing a webpage download program

Postby thedufer » Thu Feb 09, 2012 7:03 pm UTC

bittyx wrote:Also, instead of the delay and the endlessly looping program, you might want to make a script that just runs a single pass (download, compare to previous, save if different), and schedule it to run every N minutes via cron.


This. You really don't want to have a program that needs to be running all the time - it will take up memory, won't handle errors well, and won't automatically start on reboot. Cron helps with all of these. "crontab -e" is probably a relevant command.

EivindEE
Posts: 3
Joined: Tue May 24, 2011 6:31 am UTC

Re: Help writing a webpage download program

Postby EivindEE » Fri Feb 10, 2012 10:15 am UTC

I am aware of the following: the ‘wget’ command is not available in OS X, but ‘curl’ is.

I just wanted to mention that wget is available in OS X. I use it all the time.

Edit:
It's not included, but it can be downloaded using macports.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 10 guests