国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

首頁 後端開發(fā) php教程 Scraping Links With PHP

Scraping Links With PHP

Jun 23, 2016 pm 02:36 PM

Scraping Links With PHP

by justin on August 11, 2007

FROM:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

?

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn How to use cURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so you can extract links. Use XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else you could use a scraper for. Legal issues associated with scraping content. What You Will Need Basic knowledge of PHP and MySQL. A web server running PHP 5. The cURL extension for PHP. MySQL ? if you want to store the links. Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) {echo "<br />cURL error number:" .curl_errno($ch);echo "<br />cURL error:" . curl_error($ch);exit;}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT ? see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents Google ? Googlebot/2.1 ( http://www.googlebot.com/bot.html) Google Image ? Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html) MSN Live ? msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo ? Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ask Browser User Agents Firefox (WindowsXP) ? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 IE 7 ? Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30) IE 6 ? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Safari ? Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2 Opera ? Opera/9.00 (Windows NT 5.1; U; en) Using PHP’s DOM Functions To Parse The HTML

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM ? Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");
Iterate And Store Your Links

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {$href = $hrefs->item($i);$url = $href->getAttribute('href');storeLink($url,$target_url);}
 
 
FULL PROGRAM:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) { echo "
cURL error number:" .curl_errno($ch); echo "
cURL error:" . curl_error($ch); exit;}$dom = new DOMDocument();@$dom->loadHTML($html);$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); echo $url; echo "
"; }

?>
then you can store url to your database. more details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content
   
REF:tutorial on PHP and cURL 
You can find a comprehensive list of user agents here: User Agents.
Using PHP TO Scrape Sites As Feeds
   
 
本網(wǎng)站聲明
本文內(nèi)容由網(wǎng)友自願投稿,版權(quán)歸原作者所有。本站不承擔相應(yīng)的法律責任。如發(fā)現(xiàn)涉嫌抄襲或侵權(quán)的內(nèi)容,請聯(lián)絡(luò)admin@php.cn

熱AI工具

Undress AI Tool

Undress AI Tool

免費脫衣圖片

Undresser.AI Undress

Undresser.AI Undress

人工智慧驅(qū)動的應(yīng)用程序,用於創(chuàng)建逼真的裸體照片

AI Clothes Remover

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Clothoff.io

Clothoff.io

AI脫衣器

Video Face Swap

Video Face Swap

使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱工具

記事本++7.3.1

記事本++7.3.1

好用且免費的程式碼編輯器

SublimeText3漢化版

SublimeText3漢化版

中文版,非常好用

禪工作室 13.0.1

禪工作室 13.0.1

強大的PHP整合開發(fā)環(huán)境

Dreamweaver CS6

Dreamweaver CS6

視覺化網(wǎng)頁開發(fā)工具

SublimeText3 Mac版

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)

如何在PHP中實施身份驗證和授權(quán)? 如何在PHP中實施身份驗證和授權(quán)? Jun 20, 2025 am 01:03 AM

tosecurelyhandleauthenticationandationallizationInphp,lofterTheSesteps:1.AlwaysHashPasswordSwithPassword_hash()andverifyusingspasspassword_verify(),usepreparedStatatementStopreventsqlineptions,andStoreSeruserDatain usseruserDatain $ _sessiveferterlogin.2.implementrole-2.imaccessccsccccccccccccccccccccccccc.

如何在PHP中安全地處理文件上傳? 如何在PHP中安全地處理文件上傳? Jun 19, 2025 am 01:05 AM

要安全處理PHP中的文件上傳,核心在於驗證文件類型、重命名文件並限制權(quán)限。 1.使用finfo_file()檢查真實MIME類型,僅允許特定類型如image/jpeg;2.用uniqid()生成隨機文件名,存儲至非Web根目錄;3.通過php.ini和HTML表單限製文件大小,設(shè)置目錄權(quán)限為0755;4.使用ClamAV掃描惡意軟件,增強安全性。這些步驟有效防止安全漏洞,確保文件上傳過程安全可靠。

PHP中==(鬆散比較)和===(嚴格的比較)之間有什麼區(qū)別? PHP中==(鬆散比較)和===(嚴格的比較)之間有什麼區(qū)別? Jun 19, 2025 am 01:07 AM

在PHP中,==與===的主要區(qū)別在於類型檢查的嚴格程度。 ==在比較前會進行類型轉(zhuǎn)換,例如5=="5"返回true,而===要求值和類型都相同才會返回true,例如5==="5"返回false。使用場景上,===更安全應(yīng)優(yōu)先使用,==僅在需要類型轉(zhuǎn)換時使用。

如何在PHP( - , *, /,%)中執(zhí)行算術(shù)操作? 如何在PHP( - , *, /,%)中執(zhí)行算術(shù)操作? Jun 19, 2025 pm 05:13 PM

PHP中使用基本數(shù)學運算的方法如下:1.加法用 號,支持整數(shù)和浮點數(shù),也可用於變量,字符串數(shù)字會自動轉(zhuǎn)換但不推薦依賴;2.減法用-號,變量同理,類型轉(zhuǎn)換同樣適用;3.乘法用*號,適用於數(shù)字及類似字符串;4.除法用/號,需避免除以零,並註意結(jié)果可能是浮點數(shù);5.取模用%號,可用於判斷奇偶數(shù),處理負數(shù)時餘數(shù)符號與被除數(shù)一致。正確使用這些運算符的關(guān)鍵在於確保數(shù)據(jù)類型清晰並處理好邊界情況。

如何與PHP的NOSQL數(shù)據(jù)庫(例如MongoDB,Redis)進行交互? 如何與PHP的NOSQL數(shù)據(jù)庫(例如MongoDB,Redis)進行交互? Jun 19, 2025 am 01:07 AM

是的,PHP可以通過特定擴展或庫與MongoDB和Redis等NoSQL數(shù)據(jù)庫交互。首先,使用MongoDBPHP驅(qū)動(通過PECL或Composer安裝)創(chuàng)建客戶端實例並操作數(shù)據(jù)庫及集合,支持插入、查詢、聚合等操作;其次,使用Predis庫或phpredis擴展連接Redis,執(zhí)行鍵值設(shè)置與獲取,推薦phpredis用於高性能場景,Predis則便於快速部署;兩者均適用於生產(chǎn)環(huán)境且文檔完善。

我如何了解最新的PHP開發(fā)和最佳實踐? 我如何了解最新的PHP開發(fā)和最佳實踐? Jun 23, 2025 am 12:56 AM

TostaycurrentwithPHPdevelopmentsandbestpractices,followkeynewssourceslikePHP.netandPHPWeekly,engagewithcommunitiesonforumsandconferences,keeptoolingupdatedandgraduallyadoptnewfeatures,andreadorcontributetoopensourceprojects.First,followreliablesource

什麼是PHP,為什麼它用於Web開發(fā)? 什麼是PHP,為什麼它用於Web開發(fā)? Jun 23, 2025 am 12:55 AM

PHPbecamepopularforwebdevelopmentduetoitseaseoflearning,seamlessintegrationwithHTML,widespreadhostingsupport,andalargeecosystemincludingframeworkslikeLaravelandCMSplatformslikeWordPress.Itexcelsinhandlingformsubmissions,managingusersessions,interacti

如何設(shè)置PHP時區(qū)? 如何設(shè)置PHP時區(qū)? Jun 25, 2025 am 01:00 AM

tosetTherightTimeZoneInphp,restate_default_timezone_set()functionAtthestArtofyourscriptWithavalIdidentIdentifiersuchas'america/new_york'.1.usedate_default_default_timezone_set_set()

See all articles