Web Scraping With VBA

Philip Treacy

October 17, 2018

We can use VBA to retrieve webpages and comb through those pages for data we want. This is known as web scraping.

This post will look at getting data from a single web page. I've written another post that deals with getting data from multiple web pages.


Here I'm going to use ServerXMLHTTP which provides a means to communicate with websites via VBA.

NOTE: ServerXMLHTTP is intended to be used for communication between web servers. Microsoft also provide XMLHTTP which is intended for use by programs (like my VBA) to access websites.

In testing I found XMLHTTP to encounter errors accessing sites where ServerXMLHTTP did not, so I stuck with ServerXMLHTTP.

To use either you will need to set a reference in your VBA project.

Web scraping can be frowned upon if it puts too much load onto the web site, but there are legitimate reason for doing it. Just check the web site you are going to use to make sure you aren't violating their terms, and never write code that puts excessive load onto a site.

ServerXMLHTTP

Before you can use ServerXMLHTTP you need to set a reference to it in your VBA project.

Open your VBA editor (the keyboard shortcut is ALT+F11) and with your workbook selected in the VBAProject window, go to Tools -> References.

Look for Microsoft HTML Object Library and check the box beside it. Then click OK.

Reference to Microsoft HTML Object Library

Extracting Data from a Web Page

Depending on what we want, we may need to dig around in the web page to understand how that page is constructed and locate what we are after.

Load a web page in your browser (or just use this one), right click on the page and in the pop-up menu click on 'View Source' (or similar wording).

You'll see something like this

Raw HTML source

That's just some of the code that makes up our home page.

So how do you find what you want? If it's something buried in the code we have to use the Inspector built in to your browser.

You can see how to do this in Web Scraping Multiple Pages, but we don't need to do that for this example.

Extracting Links

Let's say we want to get the web addresses of the social media presence for a company.

NOTE : in web speak, a web address is referred to as a URL - Uniform Resource Locator.

The URL's for the social media profiles will be links on the web page. If you look at our home page, right at the bottom are links to our YouTube channel, Facebook, Twitter feed, Google + and our RSS feed.

social media links

If you hover your mouse over each of these social media icons you'll see in the status bar at the bottom of your browser, the URL, e.g. for our YouTube channel https://www.youtube.com/user/MyOnlineTrainingHub

hovering over social media links

So to find that channel URL on our home page I could search the page for the string "youtube". Likewise for the other social media platforms I could search for "facebook", "twitter" etc.

Bear in mind that I am assuming the only link to YouTube and the other sites on our home page is to our social media profiles. Searching just for "youtube" could find a link to a YouTube video.

I know that this isn't the case for us, but you need to make sure on the sites you are working on.

Writing the Code

After declaring variables I need to create a ServerXMLHTTP object. I'll use this to request data from a website and check what response it sends.

ServerXMLHTTP object

Download the Example Workbook

The VBA code used in this post is in the example workbook. Download a copy for yourself.

Enter your email address below to download the workbook.

By submitting your email address you agree that we can email you our Excel newsletter.

I'm reading a list of URL's from the ActiveSheet starting at cell A22, working my way down column A, and checking each site in turn.

There are two parts to getting a web page. First you must prepare the request using .Open, then you .send the request.

ServerXMLHTTP request

If the .send generates an error, maybe because the URL is badly formatted, you need to handle this so I have turned off default error handling using 'On Error Resume Next'. I turn it on again later in the code.

I can check the site responded by checking that no error occurred (Err.Number = 0)

processing ServerXMLHTTP response

It's possible for the site to send a valid response, but still not send you what you were expecting. Perhaps the page you requested is forbidden, or the web server encountered an error.

So I also need to check the Status of the response is 200, which means everything is OK. Check other HTTP status codes you may encounter.

If the site sucessfully returned the webpage I can now store that page in my HTML variable

storing ServerXMLHTTP response

To find the links I am after I can search through the webpage for <a> tags. Web pages consist of lots of HTML elements created by these tags. Along with CSS and JavaScript, HTML forms the core of almost all web pages.

If we look at the source code of our home page again we see that the link to our YouTube channel looks like this

anchor link

Notice the start of the <a tag at the beginning of the line.

I can get a list of all links on the page

get all anchor links

Now it's just a case of looping through all the links to find the ones I want, and writing those out to the sheet.

for loop

Reference

Frequently asked questions about ServerXMLHTTP

https://support.microsoft.com/en-us/help/290761/frequently-asked-questions-about-serverxmlhttp

XML DOM methods

https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms757828(v%3dvs.85)

25 thoughts on “Web Scraping With VBA”

  1. Hello everyone, my name is Maurizio
    I am writing to you because: I have carefully studied your Web Scraping With VBA project on how to get links from a web page.
    And so far everything is ok
    But if I wanted to modify your project in such a way that it gets me some text extracted from the web page

    My site in question is:
    https://www.worldweatheronline.com/San-Giusto-Canavese-Weather/Piemonte/it.aspx

    And the data I would like to get would be this (Set OggCol (0) = .getElementsByClassName (“col-sm-12”))

    Which in turn is displayed by this code:
    (Sheets (“Sheet1”). Cells (1, 1) = OggCol (0) .innerText)

    But if I add this system in your listing, my data is not displayed
    Why and how can I solve this problem?
    Thanks Greetings from Maurizio

    Reply
    • Hi Maurizio,

      Please start a topic on the forum and post your file there so we can see everything you’ve done and we don’t have to recreate the code.

      Regards

      Phil

      Reply
  2. HI, please, how could use VBA Selenium Basic to manage an existing Chrome webpage? I don’t want vba open a new windows; i’d like that the webpage in Chrome already existing was managed by vba. thank you very much NIcola

    Reply
    • Hi Nicola,

      I’m not sure if that can be done. Doing a web search shows various results that suggest it may be possible with Selenium, but it appears to require the use of another language like Python or Java. I didn’t see anything about Selenium Basic and VBA.

      Regards

      Phil

      Reply
  3. Hi, thank you very much for this working sample !!

    Is it possible to change the code in a way it writes the full source-code of a website to the next column in the worksheet?

    Thanks in advance.

    Reply
    • Hi Jan,

      When you say full source code, do you mean the HTML for the entire page? If so, then yes. When the website responds you can dump the whole page like this:

      Range(“A1”).Value = http.responseText

      Regards

      Phil

      Reply
    • Hi Vincent,
      At the end of the article, you have 2 more links, see the “More Web Scraping Posts” section.
      Cheers,
      Catalin

      Reply
  4. Hi, i need to scrap prices from hundreds of URL (same domain). In html is price in tag

    The price is 55,00 and i need parce only them from each URL. The text between ” ” is different in each URL. Is it possible to edit your .xlsm file workbook to scraping price?

    in tag meta itemProp=”description”

    In this tag is price and some other text. I need to parce only price.

    Reply
    • Hi Juraj,

      Yes I imagine this can be done but I’ll need more information.

      Please post your question on our forum and supply info like the website you want to get prices from.

      Regards

      Phil

      Reply
      • Hi Ron,

        Our firewall may be blocking you. If this is the case, the firewall will issue a response to the VBA code, so as the code is written it will not take that as an error.

        You can examine what response you are getting by uncommenting some of the Debug.Print statements I left in the code for exactly this purpose.

        Find the code

        ‘Debug.Print Len(.responseText)
        ‘Debug.Print .responseText

        and remove the ‘ from the start of each line.

        Now if you step through the code from the VBA editor (using F8) you can see what the response text is, it’ll be printed to the Immediate window.

        This article explains more about debugging vba.

        Cheers

        Phil

        Reply
    • Hi Ananya
      You have to store the current date-time before .send command, like: Startime=NOW()
      Then, before reading the response .responseText, compare the current time with the stored time: Debug.Print (Now()-StartTime)/24/60

      Reply
  5. Is it possible to download documents Directly from ERP Softwares ( like SAP , Infor ln )
    In that softwares we need to login , opens a special ( required ) sessions, gives required input , selects the devices & to click print option. Later that only back ground code will execute and we get the data.
    All of these things can we do directly from excel ( using VBA )

    Reply
    • Hi Sudheer,

      I would use Power Query to get data from external databases as it can handle the login credentials where there are built in connectors. For example, Power Query has a connector for SAP HANA databases. You can also use ODBC drivers with Power Query to get data from databases that don’t have a built in connector. You can find out more about Power Query on my Power Query course page here.

      Mynda

      Reply
  6. Thanks for providing this content.
    However it doesn’t work for me, without changing anything in the downloaded excel, the sites return ‘Error with website address’, even though they have

    http:// / https://

    at the start??
    Could be issues with my firewall / web security at work.

    Reply
    • If you want to get tables from the HTML you need to look for “table”.

      If you look for “tr” you will get all rows from all tables but you may not know what row belongs to what table.

      Phil

      Reply
  7. I am using iMac. I tried to set reference to “Microsoft HTML object Library”. I could not find it. Is it available for iMac or is it available only windows based systems ?

    Reply
    • A timeout from the website you are trying to access?

      What is the HTTP status code returned?

      Can you access the website by copying/pasting the EXACT URL you are using in VBA into your browser?

      Remember the URL must be properly formatted in VBA

      If you open topic on the forum and supply your file I can have a look at it for you.

      Regards

      Phil

      Reply

Leave a Comment

Current ye@r *