php - Using Goutte with Symfony2 in Controller -


i'm trying scrape page , i'm not familiar php frameworks, i've been trying learn symfony2. have , running, , i'm trying use goutte. it's installed in vendor folder, , have bundle i'm using scraping project.

question is, practice scraping controller? , how? have searched forever , cannot figure out how use goutte bundle, since it's buried deep withing file structure.

<?php  namespace ontf\scraperbundle\controller;  use symfony\bundle\frameworkbundle\controller\controller; use goutte\client;  class thingcontroller extends controller {   public function somethingaction($something)   {      $client = new client();     $crawler = $client->request('get', 'http://www.symfony.com/blog/');     echo $crawler->text();       return $this->render('scraperbundle:thing:index.html.twig');      // return $this->render('scraperbundle:thing:index.html.twig', array(     //     'something' => $something     //     ));   } 

}

i'm not sure have heard of "good practices" far scraping goes may able find in book php architect's guide web scraping php.

these guidelines have used in own projects:

  1. scraping slow process, consider delegating task background process.
  2. background process run cron job executing cli application or worker running.
  3. use process control system manage workers. take @ supervisord
  4. save every scraped file (the "raw" version), , log every error. enable detect problems. use rackspace cloud files or aws s3 archive these files.
  5. use symfony2 console tool create commands run scraper. can save commands in bundle under command directory.
  6. run symfony2 commands using following flags prevent running out of memory: php app/console scraper:run example.com --env=prod --no-debug app/console symfony2 console applicaiton lives, scraper:run name of command, example.com argument indicate page want scrape, , --env=prod --no-debug flags should use run in production. see code below example.
  7. inject goutte client command such:

ontf/scraperbundle/resources/services.yml

services:     goutte_client:         class: goutte\client      scrapercommand:         class:  ontf\scraperbundle\command\scrapercommand         arguments: ["@goutte_client"]         tags:             - { name: console.command } 

and command should this:

<?php // ontf/scraperbundle/command/scrapercommand.php namespace ontf\scraperbundle\command;  use symfony\component\console\command\command; use symfony\component\console\input\inputargument; use symfony\component\console\input\inputinterface; use symfony\component\console\input\inputoption; use symfony\component\console\output\outputinterface; use goutte\client;  abstract class scrapercommand extends command {     private $client;      public function __construct(client $client)     {         $this->client = $client;         parent::__construct();     }      protected function configure()     {         ->setname('scraper:run')             ->setdescription('run goutte scraper.')             ->addargument(                 'url',                 inputargument::required,                 'url want scrape.'             );     }      protected function execute(inputinterface $input, outputinterface $output)      {         $url = $input->getargument('url');         $crawler = $this->client->request('get', $url);         echo $crawler->text();     } } 

Comments

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

[C++][SFML 2.2] Strange Performance Issues - Moving Mouse Lowers CPU Usage -

ios - Possible to get UIButton sizeThatFits to work? -