Fahdi’s Personal Blog

Here I share my web ideas

Mining the Web using PHP

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
March4

There are many ways for Extracting or fetching desired data or content from an html pages. During my work, i had to build a Miner to get news from multiple sites & update a wordpress blog with the latest news, daily.I had many choices to accomplish this task, few of them are
- Regular expressions
- PHP string functions.
- Using DOM Parsing
- Using Xpath.

First of all i used regular expressions to extract data from a static html page.Which was really a nightmare for me.As there is not proper structure of page, it was really hard to extract news from specific tags.Some of them were in Div,some were in TD & others in Li tags. After a long effort i couldn’t extract data using regular expressions.
I thought to use some html parser, some of them are available on internet.I used it to get data of my choice, even it couldn’t get desired data, becuase of complexity of structure of page.After that i tried to use PHP string functions, like str_replace,Trim etc, to replace specific tags with known id, so that parsing can get easy. But still, to get desired content, it was really a mess to get to them.

Finally i used firebug to analyze the content of the page, to see the content holders, during which, i cam to find exact XPath location of desired content holder.
XPath (XML Path Language) is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values (strings, numbers, or boolean values) from the content of an XML document. but i wasn’t aware that Xpath can also be used with html pages, to parse the DOM tree of page.

finally, i got the exact path of Content holder, which had the latest news, daily updated. And i grabbed all the content from it.

I came to conclusion that in case of mining data from web, Xpath proved to be the best & most efficient.

In case of updates, we don’t need anything, but only the content holder Xpath tree structure of the content & we can get the content in single shot.Regular expression & other ways were very inefficient & time consuming.

Thanks to Farhan for sharing his valuable expertise with us on Fahd Murtaza’s Blog.

Share and Enjoy:
  • Digg
  • Facebook
  • LinkedIn
  • NewsVine
  • Technorati
  • Sphinn
  • del.icio.us
  • Mixx
  • Google
  • StumbleUpon
  • Live
  • Print this article!
  • feedmelinks
  • E-mail this story to a friend!
  • Reddit
2 Comments to

“Mining the Web using PHP”

  1. On May 8th, 2008 at 4:05 am tre Says:

    Hi Fahdi!

    Thanks for the entry. I am planning on doing one myself. Firebug comes in very handy doesn’t it.

    What I am concerned about is the fragility of this method. One minor change to the data container’s structure by the site owner and bam, our miner won’t work properly.

    I really hope that someday all data will be in XML of some sort, making it more portable. And that everyone will ditch storing non-tabular data in TABLE.

    I’ll look around th blog now ;)

  2. On August 5th, 2008 at 4:49 pm Fahd Murtaza Says:

    yeah but you always have dependencies tre. You are right about the changing structure of the site. But that gives me another direction. Making smart spider sort of thing to analyze the xpaths. Maybe I sound much like Page and Brin but at least I can think about it.

    Yeah I hope the future is of XML as we seriously need to either build strong API or we need to have some strong rule about XML specification of a site.

    There should be standards adopted in order to make XML more usable throughout the web. I wonder with XSLT what can’t be done that is usually achieved with XHTML.

    Hope you found the rest of the blog interesting. At least I can expect so lol :P.

Email will not be published

Website example

Your Comment: