| [ Index ] |
PHP Cross Reference of Crawler |
[Source view] [Print] [Project Stats]
Main crawler functions
| File Size: | 488 lines (16 kb) |
| Included or required: | 4 times |
| Referenced: | 0 times |
| Includes or requires: | 0 files |
| curl_page($url) X-Ref |
| cURL Function which returns HTML and page info as array return: array Associative array of results |
| parse_title($data) X-Ref |
| Function to parse page for title tags return: string|bool title of page of null if not found |
| parse_links($data) X-Ref |
| Function to parse page for links return: array Numeric array of links on page (URLs only) |
| parse_dir($url) X-Ref |
| Given a URL calculates the page's directory return: string Directory |
| clean_link($link, $dir) X-Ref |
| Uniformly cleans a link to avoid duplicates 1. Changes relative links to absolute (/bar to http://www.foo.com/bar) 2. Removes anchor tags (foo.html#bar to foo.html) 3. Adds trailing slash if directory (foo.com/bar to foo.com/bar/) 4. Adds www if there is not a subdomain (foo.com to www.foo.com but not bar.foo.com) return: strin cleaned link |
| is_image($link) X-Ref |
| Performs a regular expressoin to see if a given link is an image return: bool true on image, false on anything else |
| out_of_domain($link) X-Ref |
| Checks to see that a given link is within the domain whitelist Note to self: this can be rewritten using a single regex command return: bool true if out of domain, false if on domain whitelist |
| is_mailto($link) X-Ref |
| Checks to see if a given link is in fact a mailto: link return: bool true on mailto:, false on everything else |
| add_url($link,$clicks) X-Ref |
| Adds a URL to the URLs table upon discovery in a link return: bool true on sucess, false on fail |
| add_link($from,$to) X-Ref |
| Adds a link to the links table return: int|bool LinkID on sucess, false on fail |
| get_links($pageID,$click = '') X-Ref |
| Grab all links on a given page, optionally for a specific depth return: array Multidimensional array keyed by target pageID with page data |
| count_links($pageID,$direction) X-Ref |
| Shorthand MySQL function to count links in or out of a given page return: int Number of links |
| get_page($pageID) X-Ref |
| Shorthand MySQL function to get a particular page's row return: array Associative array of page data |
| uncrawled_urls() X-Ref |
| Shorthand MySQL function to to get the first 100 uncrawled URLs return: array Associative array of uncrawled URLs & page data |
| have_url($url) X-Ref |
| Checks to see if a given URL is already in the pages table return: bool true if URL exists, false if not found |
| count_slashes($url) X-Ref |
| No description |
| get_slashes($url) X-Ref |
| No description |
| url_to_absolute( $baseUrl, $relativeUrl ) X-Ref |
| Converts a relative URL (/bar) to an absolute URL (http://www.foo.com/bar) Inspired from code available at http://nadeausoftware.com/node/79, Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php) return: string Absolute URL |
| url_remove_dot_segments( $path ) X-Ref |
| Required function of URL to absolute Inspired from code available at http://nadeausoftware.com/node/79, Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php) |
| split_url( $url, $decode=TRUE ) X-Ref |
| Required function of URL to absolute Inspired from code available at http://nadeausoftware.com/node/79, Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php) |
| join_url( $parts, $encode=TRUE ) X-Ref |
| Required function of URL to absolute Inspired from code available at http://nadeausoftware.com/node/79, Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php) |
| file_size($size) X-Ref |
| Returns filesize in human readable terms Inspired by code available at http://stackoverflow.com/questions/1222245/calculating-script-memory-usages-in-php Code distributed under CC-Wiki License (http://creativecommons.org/licenses/by-sa/2.5/) |
| Generated: Thu Jun 3 17:10:09 2010 | Cross-referenced by PHPXref 0.7 |