<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fahd Murtaza&#187; data-mining</title>
	<atom:link href="http://www.fahdmurtaza.com/myblog/category/data-mining/feed" rel="self" type="application/rss+xml" />
	<link>http://www.fahdmurtaza.com/myblog</link>
	<description>Portfolio &#38; Blog</description>
	<lastBuildDate>Sat, 04 Feb 2012 06:47:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Mining the Web using PHP</title>
		<link>http://www.fahdmurtaza.com/myblog/2008/03/04/mining-the-web-using-php.html</link>
		<comments>http://www.fahdmurtaza.com/myblog/2008/03/04/mining-the-web-using-php.html#comments</comments>
		<pubDate>Tue, 04 Mar 2008 10:59:32 +0000</pubDate>
		<dc:creator>farhan042</dc:creator>
				<category><![CDATA[data-mining]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[Web Development Software]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[minind]]></category>
		<category><![CDATA[open]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.fahdmurtaza.com/myblog/2008/03/04/mining-the-web-using-php.html</guid>
		<description><![CDATA[There are many ways for Extracting or fetching desired data or content from an html pages. During my work, i had to build a Miner to get news from multiple sites &#38; update a wordpress blog with the latest news, daily.I had many choices to accomplish this task, few of them are - Regular expressions [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.fahdmurtaza.com%2Fmyblog%2F2008%2F03%2F04%2Fmining-the-web-using-php.html"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.fahdmurtaza.com%2Fmyblog%2F2008%2F03%2F04%2Fmining-the-web-using-php.html&amp;source=fahdmurtaza&amp;style=compact&amp;space=2&amp;hashtags=code,data-mining,minind,open,opensource,PHP,web&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>There are many ways  for Extracting or fetching desired data or content from an html pages. During my work, i had to build a Miner to get news from multiple sites &amp; update a wordpress blog with the latest news, daily.I had many choices to accomplish this task, few of them are<br />
- Regular expressions<br />
- PHP string functions.<br />
- Using DOM Parsing<br />
- Using Xpath.</p>
<p>First of all i used regular expressions to<span id="more-265"></span> extract data from a static html page.Which was really a nightmare for me.As there is not proper structure of page, it was really hard to extract news from specific tags.Some of them were in Div,some were in TD &amp; others in Li tags. After a long effort i couldn&#8217;t extract data using regular expressions.<br />
I thought to use some html parser, some of them are available on internet.I used it to get data of my choice, even it couldn&#8217;t get desired data, becuase of complexity of structure of page.After that i tried to use PHP string functions, like <em>str_replace</em>,Trim etc, to replace specific tags with known id, so that parsing can get easy. But still, to get desired content, it was really a mess to get to them.</p>
<p>Finally i used firebug to analyze the content of the page, to see the content holders, during which, i cam to find exact XPath location of desired content holder.<br />
XPath (XML Path Language) is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values (strings, numbers, or boolean values) from the content of an XML document. but i wasn&#8217;t aware that Xpath can also be used with html pages, to parse the DOM tree of page.</p>
<p>finally, i got the exact path of Content holder, which had the latest news, daily updated. And i grabbed all the content from it.</p>
<p>I came to conclusion that in case of mining data from web, Xpath proved to be the best &amp; most efficient.</p>
<p>In case of updates, we don&#8217;t need anything, but only the content holder Xpath tree structure of the content &amp; we can get the content in single shot.Regular expression &amp; other ways were very inefficient &amp; time consuming.</p>
<p>Thanks to Farhan for sharing his valuable expertise with us on <a href="http://www.fahdmurtaza.com/myblog/" target="_blank"><em>Fahd Murtaza&#8217;s Blog. </em></a></p>

<p class="FacebookLikeButton"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.fahdmurtaza.com%2Fmyblog%2F2008%2F03%2F04%2Fmining-the-web-using-php.html&amp;layout=standard&amp;show_faces=yes&amp;width=450&amp;action=like&amp;colorscheme=light&amp;locale=en_US" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:450px; height: 25px"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.fahdmurtaza.com/myblog/2008/03/04/mining-the-web-using-php.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

