Plan out what tables you want in your database, what columns should be in each table, and what datatype each column should have (you have many to choose from!).
Then, by hook or by crook, populate your database! There are no restrictions on how you do it, other than that your procedure must be repeatable: if angry ghosts suddenly deleted your database, you should be able to execute just one command to recreate it. The best way to do this is to generate a giant .sql file with a bunch of SQL commands in it; then you can just run this file with psql -f
to generate and populate your tables.
It's totally fine if your process involves some manual work, but just make sure that manual work produces an SQL file that you can save, and subsequently use as many times as needed to easily recreate your database.
Your database for this project is limited to at most 10 MB. In the interactive psql command prompt, you can issue the query \d+
to find out how big all your tables are. Please be nice and keep within the limit.
You are not required to check your SQL file into source control, and in fact if your database ends up being around 10 MB, your SQL file will likely be much larger. It's generally discouraged to store big data files like that in source control, and Bitbucket may yell at you if you try to push a giant file.
Submit your SQL file through the COURSES network drive. Map your network drives, go to our course directory on the COURSES drive, and copy your SQL file into the “Hand-In” directory, with the name phase_2_5.sql
. (I'm having you submit this way since you really shouldn't be putting this file in source control.)
In class, we covered three ways to interact with your Postgres database:
psql
and then typing in SQL queries on the interactive command line.psql -f foo.sql
.psycopg2
Python module.Remember that the password that psql
asks you for is your Postgres database password, not your Carleton password! Mike Tie emailed you a password a while ago; that's the one to use.
My populate.sql example file shows the kind of file that you should produce for this phase of the project.
Jeff Ondich has a whole page of advice regarding his process of semi-automatically converting an online data table into an SQL script file. Please read that over; he's got a lot of great suggestions for different strategies you might use.
One thing I'd add to Jeff's suggestions, though, is this: if you write a script to scrape many different pages on the same website, you should avoid making your requests immediately after one another. This is tantamount to a denial-of-service attack, and many website administrators will react with pretty extreme measures (blacklisting your IP address, for example) if they notice you doing it. It's easy to prevent, though; just wait briefly (a tenth of a second is enough) between any pair of requests. In Python, you do this with time.sleep(0.1)
. Use urllib
and/or urllib2
to make your requests.
For guidance about using psycopg2, read the docs, examine my example Python script, and check Google. If you want to see dbtest.py running live, you can try the live version in my directory on Thacker, so long as I haven't deleted it. (That's my working directory for testing stuff on the server, too, so there's no guarantee that things will stick around in that directory.)
To verify that everything's kosher with your database at first, I'd suggest verifying that my script works for you:
chmod
it (remember, it needs to be executable in order for the web server to run it).