Project 2.5: Populating the Database : CS 257

Design a database structure that will support the lookups that your webapp needs.
Populate your database with real data from your sources.
Get hands-on experience with scripting for data-wrangling.

Plan out what tables you want in your database, what columns should be in each table, and what datatype each column should have (you have many to choose from!).

Then, by hook or by crook, populate your database! There are no restrictions on how you do it, other than that your procedure must be repeatable: if angry ghosts suddenly deleted your database, you should be able to execute just one command to recreate it. The best way to do this is to generate a giant .sql file with a bunch of SQL commands in it; then you can just run this file with psql -f to generate and populate your tables.

It's totally fine if your process involves some manual work, but just make sure that manual work produces an SQL file that you can save, and subsequently use as many times as needed to easily recreate your database.

Your database for this project is limited to at most 10 MB. In the interactive psql command prompt, you can issue the query \d+ to find out how big all your tables are. Please be nice and keep within the limit.

You are not required to check your SQL file into source control, and in fact if your database ends up being around 10 MB, your SQL file will likely be much larger. It's generally discouraged to store big data files like that in source control, and Bitbucket may yell at you if you try to push a giant file.

Submit your SQL file through the COURSES network drive. Map your network drives, go to our course directory on the COURSES drive, and copy your SQL file into the “Hand-In” directory, with the name phase_2_5.sql. (I'm having you submit this way since you really shouldn't be putting this file in source control.)

In class, we covered three ways to interact with your Postgres database:

Manually, just by running psql and then typing in SQL queries on the interactive command line.
Non-interactively, by writing a bunch of queries in a file (conventionally with the extension “.sql”), and then running psql -f foo.sql.
Programmatically, using the psycopg2 Python module.

Remember that the password that psql asks you for is your Postgres database password, not your Carleton password! Mike Tie emailed you a password a while ago; that's the one to use.

My populate.sql example file shows the kind of file that you should produce for this phase of the project.

Jeff Ondich has a whole page of advice regarding his process of semi-automatically converting an online data table into an SQL script file. Please read that over; he's got a lot of great suggestions for different strategies you might use.

One thing I'd add to Jeff's suggestions, though, is this: if you write a script to scrape many different pages on the same website, you should avoid making your requests immediately after one another. This is tantamount to a denial-of-service attack, and many website administrators will react with pretty extreme measures (blacklisting your IP address, for example) if they notice you doing it. It's easy to prevent, though; just wait briefly (a tenth of a second is enough) between any pair of requests. In Python, you do this with time.sleep(0.1). Use urllib and/or urllib2 to make your requests.

For guidance about using psycopg2, read the docs, examine my example Python script, and check Google. If you want to see dbtest.py running live, you can try the live version in my directory on Thacker, so long as I haven't deleted it. (That's my working directory for testing stuff on the server, too, so there's no guarantee that things will stick around in that directory.)

To verify that everything's kosher with your database at first, I'd suggest verifying that my script works for you:

Download dbtest.py.
Put it in your web directory on Thacker.
chmod it (remember, it needs to be executable in order for the web server to run it).
Run it through your web browser. (Remember that Thacker is only visible through the web from computers on the Carleton network. If you're off-campus, VPN-ing in may work, but may not.) You should see it connect to my database, and maybe print out the contents of the tables. (No guarantee I haven't dropped the tables while working on something else.)
Now modify the script to use your username and password instead of mine. Then uncomment the line that prints out your password. ONLY change those three lines!
Run it again in your web browser.

Project 2.5: Populating the Database

Goals

Your Task

Turn It In

Examples and Advice