21529

cURL Html output different from original page when rendered

I am working on a project that involves fetching pages with cURL or file_get_contents. The problem is that when i try to echo the html fetched, the output seem to be different from the original page, not all images show up. Please i would like to know if there is a solution. My code

<?php //Get the url $url = "http://www.google.com"; //Get the html of url function get_data($url) { $ch = curl_init(); $timeout = 5; //$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.X.Y.Z Safari/525.13."; $userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"; curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout); $data = curl_exec($ch); curl_close($ch); return $data; } $html = file_get_contents($url); echo $html; ?>

Thanks

Answer1:

You should use <base> to specify a base url for all relative links:

If you curl http://example.com/thisPage.html then add a base tag in your echoed output of ''. This should technically be in the <head>, but this will work:

echo '<base href="http://example.com/" />'; echo $html;

<strong>Live example w <base></strong> is <strong>broken w/o <base></strong>

Answer2:

Use this //Get the html of url function get_data($url) { $ch = curl_init(); $timeout = 5; //$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.X.Y.Z Safari/525.13."; $userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"; curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout); $data = curl_exec($ch); curl_close($ch); return $data; } $parse = parse_url($url); $count = "http://".$parse['host'].dirname($parse['path'])."//"; $page = str_replace("<head>", "<head>\n<base href=\"" . $count . "\" />", $page); $page = str_replace("<HEAD>", "<head>\n<base href=\"" . $count . "\" />", $page); echo $page; ?>

Recommend

  • Extracting data from a scatter plot on matplotlib
  • send an http request without XHR in an event handler
  • What url encoding web browser uses?
  • “Use of uninitialized value $_” warning with a Mojo::UserAgent non-blocking request
  • Getting States and Provinces in .NET
  • Php Curl HTTP POST REQUEST set custom header with nested key value pairs
  • Selenium: find element by visible Text
  • How can I escape backslash in logstash grok pattern?
  • curl not working for getting a web page content, why?
  • Is it possilbe to automatically submit a php form
  • How to access list of email accounts with cPanel API?
  • Default CUDA addition rounding mode between cuda 5.0 and 7.5
  • How do I use cURL & PHP to spoof the referrer?
  • How can I update my Twitter status with Perl and only LWP::UserAgent?
  • 403 forbidden error while sending messages to facebook connector through Unification Engine API
  • Easy Way to Get Averages Based on Names in List
  • use rvest and css selector to extract table from scraped search results
  • how to display   in Mozilla using XSL.
  • Which browser have this strange user agent? (IOS device)
  • PHP: Calling a private method from within a class dying badly
  • Varnish/Apache Random 503 Errors
  • Cuda Clang and OS X Mavericks
  • Cannot upload to OneDrive using the new SDK
  • How to make jdk.nashorn.api.scripting.JSObject visible in plugin [duplicate]
  • Seeking advice on Jetty HttpClient Hang
  • Ajax jQuery multiple calls at the same time - long wait for answer and not able to cancel
  • Sony Xperia Z Tablet not found by adb
  • Javascript convert timezone issue
  • Hazelcast - OperationTimeoutException
  • Why is the timeout on a windows udp receive socket always 500ms longer than set by SO_RCVTIMEO?
  • Google cloud sdk not working when python points python3
  • File upload with ng-file-upload throwing error
  • Revoking OAuth Access Token Results in 404 Not Found
  • using HTMLImports.whenReady not working in chrome
  • How do you join a server to an Active Directory (domain)?
  • Binding checkboxes to object values in AngularJs
  • Android Heatmap on canvas or ImageView
  • Net Present Value in Excel for Grouped Recurring CF
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How to load view controller without button in storyboard?