• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't go on to the next page

shahin

Active Member
Hi there!! Hope you all are well. I got stuck at some point creating a parser. I have already made it and it is running smoothly. However, the problem I'm facing is that I can't find any element triggering to the next page except for clicking the button. As I'm using xmlhttp method, I can't avail that click button option. Any help would be a great relief for me. Thanks in advance. Here is the code:

Code:
Sub ExtractingEmail()
Const url = "http://www.sportfocus.com/comdir/keyword.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim Posts As Object, Post As Object
Dim x As Long, i As Long, Z As Long

x = 2

http.Open "GET", url, False
http.send
html.body.innerHTML = http.responseText

Set Posts = html.getElementsByClassName("dirListingPlus")
    For Each Post In Posts
        On Error Resume Next
        Cells(x, 1) = Post.getElementsByTagName("a")(0).innerText
        Cells(x, 2) = Split(Post.getElementsByTagName("a")(1).href, ":")(1)
        x = x + 1
    Next Post
End Sub
 
Hi !

So it seems it's a bad request or you forgot some argument.
Just observe via a webbrowser inner inspector tool
which request is launched when you click on button and its arguments …
 
Ist page: "http://www.sportfocus.com/comdir/keyword.cfm?cid=11&maj=cricket&ckid=102&min=clubs and Associations&kid=860&key=England"

2nd Page: "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs and Associations&kid=860&key=England"

3rd Page: "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs and Associations&kid=860&key=England"

4th Page: "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs and Associations&kid=860&key=England"

Only change noticed between first and second url is= "keyword" to "morekeywords"

2, 3 and 4 page links are identical.

Most importantly when you copy the 2, 3 or 4 page links and try to go to the page, it says file or directory not found.


Html element for the next button:

Code:
<div align="center">
            <input type="submit" value="Next 20 Results ->" class="buttonlg">
        </div>
 

If there is only same URL between page I just may say :
so it is not possible !

But 'cause that works within webbrowser, I'm very sure
you forgot to inspect parameters which should point to
first item of the page and maybe how many items within page …
 
Dear Marc L, right you are. It is POST method and there are three parameters hidden in the form element. I've tried to rectify my code but it is now parsing only three records which is far less than I was having before. Something I'm doing wrong. The postdata parameter should be filled in the way so that I can parse the whole data cause you know there are two things that might help you give me a hint about what i should do now. 1. rowlimit 2. count.



Code:
Sub ExtractingEmail()
Dim http As Object: Set http = CreateObject("MSXML2.serverXMLHTTP")
Dim html As New HTMLDocument
Dim Posts As Object, Post As Object, postdata As String
Dim x As Long, i As Long, Z As Long

x = 2

postdata = "rowlimit=20&count=21"
http.Open "Post", "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England", False
http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
http.send postdata

html.body.innerHTML = http.responseText
Set Posts = html.getElementsByClassName("dirListingPlus")
    For Each Post In Posts
        Cells(x, 1) = Post.getElementsByTagName("a")(0).innerText
        Cells(x, 2) = Split(Post.getElementsByTagName("a")(1).href, ":")(1)
        x = x + 1
    Next Post
End Sub


The form data elements are:

Code:
<form action="morekeywords.cfm?cid=11&amp;maj=cricket&amp;ckid=102&amp;min=clubs%20and%20Associations&amp;kid=860&amp;key=England" method="post">
        <input type="hidden" name="type" value="Name">
        <input type="hidden" name="rowlimit" value="20">
        <input type="hidden" name="count" value="21">
        <div align="center">
            <input type="submit" value="Next 20 Results ->" class="buttonlg">
        </div>
    </form>


Thanks in advance.
 

So it seems parameters are rowlimit for the items number in page
and count for the page first item …

You can use these parameters within request send command
like in this thread.

Some webpages may need specific request headers to well work …
 
Dear Marc L, I've modified my code a little bit but still no luck. This time it is fetching data (20) of the first page over and over again. Here is the code:

Code:
Sub ExtractingEmail()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim Posts As Object, Post As Object, postdata As String
Dim x As Long, y As Long

x = 2
For y = 1 To 21
    postdata = "type=Name&rowlimit=729&count=" & y
    http.Open "Post", "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England", False
    http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
    http.send postdata
  
    html.body.innerHTML = http.responseText
    Set Posts = html.getElementsByClassName("dirListingPlus")
    For Each Post In Posts
        On Error Resume Next
        Cells(x, 1) = Post.getElementsByTagName("a")(0).innerText
        Cells(x, 2) = Split(Post.getElementsByTagName("a")(1).href, ":")(1)
        x = x + 1
    Next Post
Next y
End Sub
 

As first 20 items page starts from 1
second page starts from 21, third from 41, …

(Logic at child level)
 
Too many not understandable things within your code !

• Why a loop as you read all item ?!
• When you use a flag for items, check at least
if it is common to all needed items !

You can fix it just by reading source code page #2,
inspect for example last item then
check if needed data are in responseText.
If you don't see anything so you have to pilot IE.

But here - as always - it is just a purpose of well observation,
if you can't follow Document Object Model of this webpage,
do not ever use htmlfile object but just extract data
from responseText using VBA basics text function
and like the Logic is at very child level, it's not difficult …
 
Thanks dear Marc L, you are always right. I should have studied the page source attentively. There is always a clue to work with as I got here. You hinted me in the first place, though! This is the first time I read the source code of 2nd page, third page etc. Thanks again. Here is the refined one:

Code:
Sub ExtractingEmail()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim Posts As Object, post As Object, postdata As String
Dim x As Long, y As Long

x = 2
For y = 0 To 5
    postdata = "type=Name&rowlimit=20&count=" & (y * 20) + 1
    http.Open "Post", "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England", False
    http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
    http.send postdata
    html.body.innerHTML = http.responseText
    Set Posts = html.getElementsByClassName("dirListingPlus")
    For Each post In Posts
        On Error Resume Next
        Cells(x, 1) = post.getElementsByTagName("a")(0).innerText
        Cells(x, 2) = Split(post.getElementsByTagName("a")(1).href, ":")(1)
        x = x + 1
    Next post
Next y
End Sub
 

The loop is useless as you can read all data at once,
like in your previous code but without any loop …

Is your code working as expected ? Which data you need to extract ?
 
Right you are. No, the code is not working as i expected. I need to parse the name of 4 pages from 1 to 4. Thanks again for your response.
 
From 2nd page, class name is blank. Elements for class is as below:

Code:
<dl class="">
        <dt><a href="cditem.cfm?NID=10995">Abberton &amp; District Cricket Club</a></dt>
        <dd>
            COLCHESTER<br>  &nbsp;
           
        </dd>
       
    </dl>
 
As I yet wrote …
So it seems parameters are rowlimit for the items number in page
and count for the page first item …
So remove y loop - very useless except to waste time ! - and
update send parameters for the items number to read
and from the first item !

But above all finally apply correct logic as your actual code is looking
for specific items only and not all those you need ‼ As I yet wrote :
You can fix it just by reading source code page #2,
inspect for example last item
So when at least you achieve to check that,
you must finally understand dirListingPlus class name is a no sense ‼
As you could be warned without using On Error Resume Next (a bad use !) …

Edit : you finally checked it and found class name is not a good flag !

As using a class name could be a mess (often !), better is to watch
within responseText if a flag may be an ID, a name or a tag name …

And all your need is doable without using htmlfile, just an easy parsing text …

As a training, follow this thread and try on your side
before cheating with my codes (from post #10) …
 
Found a solution but this is tedious. I can't run the loop but if i input the value of "count" then it scrapes successfully. Basically, I was not getting records of 2nd page, third page and so on cause class elements for the first page and the rest of the pages are different. From second page the class is with no value. So I tried responsetext method to make it right.
Code:
Sub ExtractingEmail()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim postdata As String, str As Variant, y As Long, i As Long
Dim x As Long

x = 2
postdata = "type=Name&rowlimit=20&count=1"
http.Open "Post", "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England", False
http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
http.send postdata

html.body.innerHTML = http.responseText
str = Split(http.responseText, "<dt><a")
y = UBound(str)
    For i = 1 To y
        On Error Resume Next
        Cells(x, 1) = Split(Split(Split(str(i), " href=""")(1), """>")(1), "<")(0)
        Cells(x, 2) = Split(Split(str(i), " href=""mailto:")(1), """")(0)
        x = x + 1
    Next i
End Sub
 
Ha ha ha!!!! Well said, Cheating with your code.:DD Yeah that's right. I tried few times with your code to serve my purpose. Anyways, now I know how to use split function. Btw, do have a look If you are not bothered already, to give me a clue how i can run the loop now.
 
Your parameter is for reading only 20 items but you wrote « for 4 pages » :
so the question is why didn't you mod the items number
according to your need ?!
If a page contains 20 items, how many items for 4 pages ?
My child just solves this arithmetic !

So it seems parameters are rowlimit for the items number
update send parameters for the items number to read
(As reminders …)
 
Dear Marc L, Thanks for the answer. Before modifying my code I tried manipulating with rowlimits but after writing responsetext method I didn't notice that mistake. Now, It is working like magic. However, The issue is clearly solved.
 
Dear Marc L, here it is. The count parameter is just useless. No looping is required. The total data is 729 and it takes 5 seconds or less. Thanks again.
Code:
Sub ExtractingEmail()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim postdata As String, str As Variant, y As Long, i As Long
Dim x As Long

x = 2
postdata = "type=Name&rowlimit=729&count=1"
http.Open "Post", "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs%20and%20Associations&kid=860&key=England", False
http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
http.send postdata

html.body.innerHTML = http.responseText
str = Split(http.responseText, "<dt><a")
y = UBound(str)
    For i = 1 To y
        On Error Resume Next
        Cells(x, 1) = Split(Split(Split(str(i), " href=""")(1), """>")(1), "<")(0)
        Cells(x, 2) = Split(Split(str(i), " href=""mailto:")(1), """")(0)
        x = x + 1
    Next i
End Sub
 
Line #3 : Dim html As New HTMLDocument
Useless as you do not use this object to parse data !
Waste of memory as unused and you forgot to release objects
from memory at procedure ending (see Nothing in VBA help).

And using With statement object variable could be useless too !
(To see in next demonstration …)
Line #13 : html.body.innerHTML = http.responseText
Same as previous !
Line #4 : Dim postdata As String, str As Variant, y As Long, i As Long
str variable is used to load Split result,
can be declared str() as String 'cause all you grab is only text …
Line #8 : postdata = "type=Name&rowlimit=729&count=1"
• A variable to only use it once may be useless,
text can directly be in send codeline …
Except for long constant like an URL for example …
• What occurs if a new item is created ? You will missed it !
Line #14 : str = Split(http.responseText, "<dt><a")
Well done for the flag !
Line #15 : y = UBound(str)
Like previous postdata variable, useless to just use it once !
UBound(str) can directly be in next codeline …
Line #10 : without this header the request fails !

Line #17 : On Error Resume Next
To avoid within a loop, you could place it just before …

But can be removed from code !
Just by checking items number within a Split result variable,
to see in next demonstration …
 
As reminders before writing any codeline,
well observe how webpage works,
if any request is used when clicking a button,
if the request has parameters,
if the request needs specific headers,
checking if data are in request response, …​

Respecting this will save your time !

Demonstration to read all items without On Error statement
using classic historic via romana way :​
Code:
Sub Demo1()
         Const URL = "http://www.sportfocus.com/comdir/morekeywords.cfm?cid=11&maj=cricket&ckid=102&min=clubs and Associations&kid=860&key=England"
    With New XMLHTTP60
        .Open "POST", URL, False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .setRequestHeader "DNT", "1"
        .send "type=name&rowlimit=999&count=1"
         SPQ = Split(.responseText, "<dt><a href=")
    End With
    If UBound(SPQ) > 0 Then
       ReDim VA$(1 To UBound(SPQ), 1)
        For R& = 1 To UBound(SPQ)
                                       SP = Split(SPQ(R), "<a href=""mailto:")
                                 VA(R, 0) = Split(Split(SP(0), ">")(1), "<")(0)
          If UBound(SP) > 0 Then VA(R, 1) = Split(SP(1), """")(0)
        Next
            ActiveSheet.UsedRange.Offset(1).Clear
            [A2].Resize(UBound(VA), 2).Value = VA
    End If
End Sub
Note some webpage request may not accept 999 parameter value
as above the real items number or superior to a max limit …
 
OMG!!! It's really an explicit explanation. I will follow it from now on. The thing is that I'm not from an IT background and what I write I learned from different tutorials on the web. So, my basic is not that strong in such technical things. However, better late than never and I'm learning while mistaking. That's it.
Btw, should I set http=nothing in the end?
 
Back
Top