TYPO3  7.6
Public Member Functions | Public Attributes | List of all members
CrawlerHook Class Reference

Public Member Functions

 crawler_init (&$pObj)
 
 crawler_execute ($params, &$pObj)
 
 crawler_execute_type1 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type2 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type3 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type4 ($cfgRec, &$session_data, $params, &$pObj)
 
 cleanUpOldRunningConfigurations ()
 
 checkUrl ($url, $urlLog, $baseUrl)
 
 indexExtUrl ($url, $pageId, $rl, $cfgUid, $setId)
 
 indexSingleRecord ($r, $cfgRec, $rl=null)
 
 getUidRootLineForClosestTemplate ($id)
 
 generateNextIndexingTime ($cfgRec)
 
 checkDeniedSuburls ($url, $url_deny)
 
 addQueueEntryForHook ($cfgRec, $title)
 
 deleteFromIndex ($id)
 
 processCmdmap_preProcess ($command, $table, $id, $value, $pObj)
 
 processDatamap_afterDatabaseOperations ($status, $table, $id, $fieldArray, $pObj)
 

Public Attributes

 $secondsPerExternalUrl = 3
 
 $instanceCounter = 0
 
 $callBack = CrawlerHook::class
 

Detailed Description

Crawler hook for indexed search. Works with the "crawler" extension

Definition at line 24 of file Hook/CrawlerHook.php.

Member Function Documentation

addQueueEntryForHook (   $cfgRec,
  $title 
)

Adding entry in queue for Hook

Parameters
array$cfgRecConfiguration record
string$titleTitle/URL
Returns
void

Definition at line 629 of file Hook/CrawlerHook.php.

checkDeniedSuburls (   $url,
  $url_deny 
)

Checks if $url has any of the URls in the $url_deny "list" in it and if so, returns TRUE.

Parameters
string$urlURL to test
string$url_denyString where URLs are separated by line-breaks; If any of these strings is the first part of $url, the function returns TRUE (to indicate denial of decend)
Returns
bool TRUE if there is a matching URL (hence, do not index!)

Definition at line 608 of file Hook/CrawlerHook.php.

References $url, GeneralUtility\isFirstPartOfStr(), and GeneralUtility\trimExplode().

Referenced by CrawlerHook\crawler_execute_type3().

checkUrl (   $url,
  $urlLog,
  $baseUrl 
)

Check if an input URL are allowed to be indexed. Depends on whether it is already present in the url log.

Parameters
string$urlURL string to check
array$urlLogArray of already indexed URLs (input url is looked up here and must not exist already)
string$baseUrlBase URL of the indexing process (input URL must be "inside" the base URL!)
Returns
string Returls the URL if OK, otherwise FALSE

Definition at line 455 of file Hook/CrawlerHook.php.

References $url, and GeneralUtility\isFirstPartOfStr().

Referenced by CrawlerHook\crawler_execute_type3().

cleanUpOldRunningConfigurations ( )

Look up all old index configurations which are finished and needs to be reset and done

Returns
void

Definition at line 414 of file Hook/CrawlerHook.php.

References $GLOBALS, and BackendUtility\deleteClause().

Referenced by CrawlerHook\crawler_init().

crawler_execute (   $params,
$pObj 
)

Call back function for execution of a log element

Parameters
array$paramsParams from log element. Must contain $params['indexConfigUid']
object$pObjParent object (tx_crawler lib)
Returns
array Result array

Definition at line 161 of file Hook/CrawlerHook.php.

References $GLOBALS, CrawlerHook\crawler_execute_type1(), CrawlerHook\crawler_execute_type2(), CrawlerHook\crawler_execute_type3(), CrawlerHook\crawler_execute_type4(), and GeneralUtility\getUserObj().

crawler_execute_type1 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing records from a table

Parameters
array$cfgRecIndexing Configuration Record
array$session_dataSession data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$paramsParameters from the log queue.
object$pObjParent object (from "crawler" extension!)
Returns
void

Definition at line 221 of file Hook/CrawlerHook.php.

References $GLOBALS, BackendUtility\BEenableFields(), BackendUtility\deleteClause(), CrawlerHook\getUidRootLineForClosestTemplate(), and CrawlerHook\indexSingleRecord().

Referenced by CrawlerHook\crawler_execute().

crawler_execute_type2 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing files from fileadmin

Parameters
array$cfgRecIndexing Configuration Record
array$session_dataSession data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$paramsParameters from the log queue.
object$pObjParent object (from "crawler" extension!)
Returns
void

Definition at line 266 of file Hook/CrawlerHook.php.

References $GLOBALS, elseif, GeneralUtility\get_dirs(), GeneralUtility\getAllFilesAndFoldersInPath(), GeneralUtility\getFileAbsFileName(), CrawlerHook\getUidRootLineForClosestTemplate(), GeneralUtility\isAbsPath(), GeneralUtility\isAllowedAbsPath(), GeneralUtility\makeInstance(), GeneralUtility\removePrefixPathFromList(), and GeneralUtility\trimExplode().

Referenced by CrawlerHook\crawler_execute().

crawler_execute_type3 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing External URLs

Parameters
array$cfgRecIndexing Configuration Record
array$session_dataSession data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$paramsParameters from the log queue.
object$pObjParent object (from "crawler" extension!)
Returns
void

Definition at line 328 of file Hook/CrawlerHook.php.

References $GLOBALS, $url, CrawlerHook\checkDeniedSuburls(), CrawlerHook\checkUrl(), CrawlerHook\getUidRootLineForClosestTemplate(), and CrawlerHook\indexExtUrl().

Referenced by CrawlerHook\crawler_execute().

crawler_execute_type4 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Page tree indexing type

Parameters
array$cfgRecIndexing Configuration Record
array$session_dataSession data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$paramsParameters from the log queue.
object$pObjParent object (from "crawler" extension!)
Returns
void

Definition at line 369 of file Hook/CrawlerHook.php.

References $GLOBALS, $url, BackendUtility\deleteClause(), and BackendUtility\getRecord().

Referenced by CrawlerHook\crawler_execute().

crawler_init ( $pObj)

Initialization of crawler hook. This function is asked for each instance of the crawler and we must check if something is timed to happen and if so put entry(s) in the crawlers log to start processing. In reality we select indexing configurations and evaluate if any of them needs to run.

Parameters
object$pObjParent object (tx_crawler lib)
Returns
void

Definition at line 53 of file Hook/CrawlerHook.php.

References $GLOBALS, CrawlerHook\cleanUpOldRunningConfigurations(), BackendUtility\deleteClause(), CrawlerHook\generateNextIndexingTime(), GeneralUtility\getUserObj(), and GeneralUtility\md5int().

deleteFromIndex (   $id)

Deletes all data stored by indexed search for a given page

Parameters
int$idUid of the page to delete all pHash
Returns
void

Definition at line 646 of file Hook/CrawlerHook.php.

References $GLOBALS.

Referenced by CrawlerHook\processCmdmap_preProcess(), and CrawlerHook\processDatamap_afterDatabaseOperations().

generateNextIndexingTime (   $cfgRec)

Generate the unix time stamp for next visit.

Parameters
array$cfgRecIndex configuration record
Returns
int The next time stamp

Definition at line 582 of file Hook/CrawlerHook.php.

References $GLOBALS.

Referenced by CrawlerHook\crawler_init().

getUidRootLineForClosestTemplate (   $id)

Get rootline for closest TypoScript template root. Algorithm same as used in Web > Template, Object browser

Parameters
int$idThe page id to traverse rootline back from
Returns
array Array where the root lines uid values are found.

Definition at line 557 of file Hook/CrawlerHook.php.

References GeneralUtility\makeInstance().

Referenced by CrawlerHook\crawler_execute_type1(), CrawlerHook\crawler_execute_type2(), CrawlerHook\crawler_execute_type3(), and CrawlerHook\indexSingleRecord().

indexExtUrl (   $url,
  $pageId,
  $rl,
  $cfgUid,
  $setId 
)

Indexing External URL

Parameters
string$urlURL, http://....
int$pageIdPage id to relate indexing to.
array$rlRootline array to relate indexing to
int$cfgUidConfiguration UID
int$setIdSet ID value
Returns
array URLs found on this page

Definition at line 478 of file Hook/CrawlerHook.php.

References $list, $url, GeneralUtility\makeInstance(), and GeneralUtility\resolveBackPath().

Referenced by CrawlerHook\crawler_execute_type3().

indexSingleRecord (   $r,
  $cfgRec,
  $rl = null 
)

Indexing Single Record

Parameters
array$rRecord to index
array$cfgRecConfiguration Record
array$rlRootline array to relate indexing to
Returns
void

Definition at line 525 of file Hook/CrawlerHook.php.

References $GLOBALS, CrawlerHook\getUidRootLineForClosestTemplate(), GeneralUtility\makeInstance(), and GeneralUtility\trimExplode().

Referenced by CrawlerHook\crawler_execute_type1(), and CrawlerHook\processDatamap_afterDatabaseOperations().

processCmdmap_preProcess (   $command,
  $table,
  $id,
  $value,
  $pObj 
)

TCEmain hook function for on-the-fly indexing of database records

Parameters
string$commandTCEmain command
string$tableTable name
string$idRecord ID. If new record its a string pointing to index inside ::substNEWwithIDs
mixed$valueTarget value (ignored)
FormEngine$pObjtcemain calling object
Returns
void

Definition at line 685 of file Hook/CrawlerHook.php.

References CrawlerHook\deleteFromIndex().

processDatamap_afterDatabaseOperations (   $status,
  $table,
  $id,
  $fieldArray,
  $pObj 
)

TCEmain hook function for on-the-fly indexing of database records

Parameters
string$statusStatus "new" or "update
string$tableTable name
string$idRecord ID. If new record its a string pointing to index inside ::substNEWwithIDs
array$fieldArrayField array of updated fields in the operation
FormEngine$pObjtcemain calling object
Returns
void

Definition at line 703 of file Hook/CrawlerHook.php.

References $GLOBALS, BackendUtility\deleteClause(), CrawlerHook\deleteFromIndex(), elseif, BackendUtility\getRecord(), and CrawlerHook\indexSingleRecord().

Member Data Documentation

$callBack = CrawlerHook::class

Definition at line 43 of file Hook/CrawlerHook.php.

$instanceCounter = 0

Definition at line 38 of file Hook/CrawlerHook.php.

$secondsPerExternalUrl = 3

Definition at line 31 of file Hook/CrawlerHook.php.