Web Archiver - Documentation

by Ravneet Singh
Purpose
    The purpose of this application is to archive pages to our own machines to prevent having broken links in the future to pages on foreign servers.
Setup: Main settings
    The main program is located in http://www.cs.rutgers.edu/common/apps/webarchiver/. There you can modify and edit all pages archived using this engine. This is meant to be a master administrative application. The settings are located in 'include/archive_settings.php'.
archive_settings.php
Variable Description
$lootFolder The web accessible folder where all archived pages will be saved.
$tempFolder Web accessible folder where temporary files will be created.
$httpLootFolder The http location of the $lootFolder.
$httpTempFolder The http location of the $tempFolder
$forwardfinishedPage Page to forward to after the script has finished archiving. (This should NOT be changed)
$archive_authorized_user Authorization user name
Setup: Database
    The database needs only to be modified when creating a new category. To add a new category you need to add a row to the 'CATEGORIES' table. You must create another view based off of the 'ARTICLE' table limiting the view by the newly added category.
Setup: Category
    An example setup of a category can be found at http://cs.rutgers.edu/news/pressreleases/. The exact files can be copied to another location to setup a new category. Modify the settings.php to new settings.
settings.php
Variable Description
$view_table Name of the view to use.
$default_category The ID for the main category. User definable incase a view displays multiple categories.
$page_title Title of the category. Displayed on all pages.
$user_authorized User name for the authorization.
$rootFolder The folder of the MAIN scripts.
    The categories execute the same scripts as the main application. The variables included in the settings.php specify the limiting category.
Database Structure
    There are four tables used in this application. The 'ARTICLE' table stores information about the article. EG. subject, date, category, dates, parent page, and other information about the article. Each page archived under an article has its information saved in the 'ARCHIVED_PAGE' table. This table stores information about the page such as the URL, local location, and date archived. Any sub pages that are archived can be found from the 'ARCHIVED_PAGE_CHILD' table. The 'CATEGORIES' table is used to store categories.
Archiving Engine
    The archiving engine uses the built in function file_get_contents() to retrieve any content on a foreign server. After retrieving the source of a page it parses. It locates all of the <> tags and any information in between it. If the information starts with a '/', it does nothing, else it parses the string and creates a node which contains an array of attributes. Based of off the attributes the engine decides what to do with the element. Any changes to the attributes are saved back in to the source. If the engine decides to archive content within the tag, it downloads the content using the file_get_contents() function. It saving the downloaded file in a path based off of the location of where it retrieved the content from. EG. article_folder/host_name/uri.
Linked pages
    Linked pages are stored as any regular page is. The same processes is applied in archiving the page. When the parent page is being parsed, all links are replaced with 'javascrout:void(0)', but if a child page is found that is a copy of the remote link, the link is replaced with the location of the local copy.
Introduction Page
The introduction page is suppose to be a page that contains a description of a webpage(s). Any links found in the introduction can be archived or left alone, pointing to the original location. The creation and the editing of introduction pages  is done by using browser built in editors. Upon saving the page is stored in a temporary location and archived. The only difference in this this archive is that any sub links that are not archived are not voided. They point to the actual location!
Login