php - Using Goutte with Symfony2 in Controller -
i'm trying scrape page , i'm not familiar php frameworks, i've been trying learn symfony2. have , running, , i'm trying use goutte. it's installed in vendor folder, , have bundle i'm using scraping project.
question is, practice scraping controller
? , how? have searched forever , cannot figure out how use goutte
bundle, since it's buried deep withing file structure.
<?php namespace ontf\scraperbundle\controller; use symfony\bundle\frameworkbundle\controller\controller; use goutte\client; class thingcontroller extends controller { public function somethingaction($something) { $client = new client(); $crawler = $client->request('get', 'http://www.symfony.com/blog/'); echo $crawler->text(); return $this->render('scraperbundle:thing:index.html.twig'); // return $this->render('scraperbundle:thing:index.html.twig', array( // 'something' => $something // )); }
}
i'm not sure have heard of "good practices" far scraping goes may able find in book php architect's guide web scraping php.
these guidelines have used in own projects:
- scraping slow process, consider delegating task background process.
- background process run cron job executing cli application or worker running.
- use process control system manage workers. take @ supervisord
- save every scraped file (the "raw" version), , log every error. enable detect problems. use rackspace cloud files or aws s3 archive these files.
- use symfony2 console tool create commands run scraper. can save commands in bundle under command directory.
- run symfony2 commands using following flags prevent running out of memory:
php app/console scraper:run example.com --env=prod --no-debug
app/console symfony2 console applicaiton lives, scraper:run name of command, example.com argument indicate page want scrape, , --env=prod --no-debug flags should use run in production. see code below example. - inject goutte client command such:
ontf/scraperbundle/resources/services.yml
services: goutte_client: class: goutte\client scrapercommand: class: ontf\scraperbundle\command\scrapercommand arguments: ["@goutte_client"] tags: - { name: console.command }
and command should this:
<?php // ontf/scraperbundle/command/scrapercommand.php namespace ontf\scraperbundle\command; use symfony\component\console\command\command; use symfony\component\console\input\inputargument; use symfony\component\console\input\inputinterface; use symfony\component\console\input\inputoption; use symfony\component\console\output\outputinterface; use goutte\client; abstract class scrapercommand extends command { private $client; public function __construct(client $client) { $this->client = $client; parent::__construct(); } protected function configure() { ->setname('scraper:run') ->setdescription('run goutte scraper.') ->addargument( 'url', inputargument::required, 'url want scrape.' ); } protected function execute(inputinterface $input, outputinterface $output) { $url = $input->getargument('url'); $crawler = $this->client->request('get', $url); echo $crawler->text(); } }
Comments
Post a Comment