[ Index ]

PHP Cross Reference of Crawler

title

Body

[close]

/includes/ -> functions.php (summary)

Main crawler functions

File Size: 488 lines (16 kb)
Included or required: 4 times
Referenced: 0 times
Includes or requires: 0 files

Defines 22 functions

  curl_page()
  parse_title()
  parse_links()
  parse_dir()
  clean_link()
  is_image()
  out_of_domain()
  is_mailto()
  add_url()
  add_link()
  get_links()
  count_links()
  get_page()
  uncrawled_urls()
  have_url()
  count_slashes()
  get_slashes()
  url_to_absolute()
  url_remove_dot_segments()
  split_url()
  join_url()
  file_size()

Functions
Functions that are not part of a class:

curl_page($url)   X-Ref
cURL Function which returns HTML and page info as array

return: array Associative array of results

parse_title($data)   X-Ref
Function to parse page for title tags

return: string|bool title of page of null if not found

parse_links($data)   X-Ref
Function to parse page for links

return: array Numeric array of links on page (URLs only)

parse_dir($url)   X-Ref
Given a URL calculates the page's directory

return: string Directory

clean_link($link, $dir)   X-Ref
Uniformly cleans a link to avoid duplicates

1. Changes relative links to absolute (/bar to http://www.foo.com/bar)
2. Removes anchor tags (foo.html#bar to foo.html)
3. Adds trailing slash if directory (foo.com/bar to foo.com/bar/)
4. Adds www if there is not a subdomain (foo.com to www.foo.com but not bar.foo.com)

return: strin cleaned link

is_image($link)   X-Ref
Performs a regular expressoin to see if a given link is an image

return: bool true on image, false on anything else

out_of_domain($link)   X-Ref
Checks to see that a given link is within the domain whitelist

Note to self: this can be rewritten using a single regex command

return: bool true if out of domain, false if on domain whitelist

is_mailto($link)   X-Ref
Checks to see if a given link is in fact a mailto: link

return: bool true on mailto:, false on everything else

add_url($link,$clicks)   X-Ref
Adds a URL to the URLs table upon discovery in a link

return: bool true on sucess, false on fail

add_link($from,$to)   X-Ref
Adds a link to the links table

return: int|bool LinkID on sucess, false on fail

get_links($pageID,$click = '')   X-Ref
Grab all links on a given page, optionally for a specific depth

return: array Multidimensional array keyed by target pageID with page data

count_links($pageID,$direction)   X-Ref
Shorthand MySQL function to count links in or out of a given page

return: int Number of links

get_page($pageID)   X-Ref
Shorthand MySQL function to get a particular page's row

return: array Associative array of page data

uncrawled_urls()   X-Ref
Shorthand MySQL function to to get the first 100 uncrawled URLs

return: array Associative array of uncrawled URLs & page data

have_url($url)   X-Ref
Checks to see if a given URL is already in the pages table

return: bool true if URL exists, false if not found

count_slashes($url)   X-Ref
No description

get_slashes($url)   X-Ref
No description

url_to_absolute( $baseUrl, $relativeUrl )   X-Ref
Converts a relative URL (/bar) to an absolute URL (http://www.foo.com/bar)

Inspired from code available at http://nadeausoftware.com/node/79,
Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php)

return: string Absolute URL

url_remove_dot_segments( $path )   X-Ref
Required function of URL to absolute

Inspired from code available at http://nadeausoftware.com/node/79,
Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php)


split_url( $url, $decode=TRUE )   X-Ref
Required function of URL to absolute

Inspired from code available at http://nadeausoftware.com/node/79,
Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php)


join_url( $parts, $encode=TRUE )   X-Ref
Required function of URL to absolute

Inspired from code available at http://nadeausoftware.com/node/79,
Code distributed under OSI BSD (http://www.opensource.org/licenses/bsd-license.php)


file_size($size)   X-Ref
Returns filesize in human readable terms

Inspired by code available at http://stackoverflow.com/questions/1222245/calculating-script-memory-usages-in-php
Code distributed under CC-Wiki License (http://creativecommons.org/licenses/by-sa/2.5/)




Generated: Thu Jun 3 17:10:09 2010 Cross-referenced by PHPXref 0.7