Link to WFC Website

 Newsletter - July, 1999

 

Web Scraping with Visual Basic 

Back in the 1980's, a hot topic was "screen scraping" -- quick-and-dirty applications that pretended to behave like 3270 or vt100 terminals to legacy systems. These applications were not reliable, but they were cheap. They provided lots of bang-for-the-buck, as compared to doing it the proper way.

With web farming, we are in a similar situation with web scrapers, applications that process the HTML of a web page to extract meaningful data. Again, web scrapers are not reliable, but they are cheap and useful.

Let's examine a simple application that I have been operating for several months. It is written in Microsoft Visual Basic version 6. The application retrieves a web page from Amazon.com that contains a description of my Web Farming book. Within that description, it extracts the sales rank of the book and writes that value to a file.

click to see current ranking

The calculation of the sales rank is based on Amazon.com sales and is updated regularly. The top 10,000 best sellers are updated each hour to reflect sales over the previous 24 hours. The next 100,000 are updated once a day. The rest of the list is updated monthly, based on various factors. The lower the number, the higher the sales for that particular title. When compared with more than two million books, this ranking is the closest indicator to a stock price for a book!

If you examine the HTML source for this page, you will find the following fragment amid lots of  garbage:

<font face=verdana,arial,helvetica size=-1>
<b>Amazon.com Sales Rank: </b>
4,518
</font><br>

The task is to retrieve the HTML as a text string, search for an initial pre-pattern, search for a post-pattern, extract the text between the two patterns, convert the text to a numeric value, and write it to a file.

Sounds simple? Actually, it is! Here are the steps:

  1. Open Visual Studio for VB and create a new standard EXE.
  2. Under the Project menu, click on Components to open the Components dialog box. Check the box for Microsoft Internet Transfer Control 6 (which will load MSINET.OCX). Be sure to check the box, and then click OK. This is the secret ingredient!
  3. You should see on your tool bar at the left a new icon (a world with a terminal in front). Click on this icon. Then, drag a square on the form.
  4. Also add a label with caption 'Ranking'. Add a textbox and a command button with caption 'Probe for Amazon Sales Ranking'. Your form should look like this. Not pretty, but simple.
  5. Now for the fun stuff! Double click on the button to open the code window. For the routine Command1_Click, copy and paste the following. Watch for extra line breaks.

    Private Sub Command1_Click()
      Dim strPage, strISBN, strURL As String
      On Error Resume Next

      ' set the proper URL to Amazon.Com asking for specific book
      strISBN = "1558605037" ' ISBN for the WFbook
      strURL = "http://www.amazon.com/exec/obidos/ASIN/" _
           & strISBN & "/"
      ' get the webpage content using Inet control
      strPage = Inet1.OpenURL(strURL, icString)
      ' put the ranking value into the textbox
      text1.Text = GetRank(strPage, "Sales Rank: </b>", "</font>")
    End Sub

    All the work is performed in the INET control with the OpenURL method. If there is an error, the string strPage remains empty because of the Resume Next.

  1. One more piece of code is required. The GetRank routine must parse all of that messy HTML and extract the sales ranking. Note that the arguments are the HTML buffer, the pattern prior to the ranking, and the pattern after the ranking. Here is the code to copy and paste after the previous routine. Trust me; it works!

    Private Function GetRank(strPage, strPrePat, strPostPat As String) As String
      Dim iStart, iEnd As Integer
      Dim strIn, strOut As String

      GetRank = ""
      iStart = InStr(1, strPage, strPrePat) ' find first pattern
      If iStart <> 0 Then
        iStart = iStart + Len(strPrePat)
        iEnd = InStr(iStart, strPage, strPostPat) ' second
        If iEnd <> 0 Then
          strIn = Mid(strPage, iStart, iEnd - iStart)
          strOut = ""
          For iStart = 1 To Len(strIn) ' strip out control chars
            If Mid(strIn, iStart, 1) < " " Then
              strOut = strOut & " "   ' add a blank instead
            Else
              strOut = strOut & Mid(strIn, iStart, 1)
            End If
          Next iStart
        GetRank = Trim(strOut)    ' return extracted value
        End If
      End If
    End Function

  1. Now save and run. Be sure that your connection to the Internet is active. After clicking the button, there will be a pause and a number should appear in the text box. Hopefully, the number will be a low one for such an excellent book!

In the version that I use, I added a timer so that every hour the Amazon web page will be probed. Whenever the value changes, it is written to a comma-delimited text file for import into Excel for charting. So far, I have reliably recorded the sale ranking every hour for over four months.

I would like to hear about your experiences and enhancements.

- Richard Hackathorn
dick@webfarming.com