1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Trouble Parsing Image and Price

Discussion in 'VBA Macros' started by shahin, Mar 5, 2017.

  1. shahin

    shahin Member

    Messages:
    193
    Hi there! I've made a parser which is pulling 3 categories excluding Image and price out of five categories. I could not find any "img" tag there. Moreover the page contains lots of data but only 25 are being scraped. The page is loading slowly to display the full content though! Here is the code I've written. Any help would be a great help to me. Thanks in advance.

    Code (vb):

    Sub VariousWines()
    Const URL = "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim topics As Object, posts As Object, topic As HTMLHtmlElement, post As HTMLHtmlElement
    Dim i As Long, x As Long

    x = 2

    http.Open "GET", URL, False
    http.send
    html.body.innerHTML = http.responseText

    Set topics = html.getElementsByClassName("card card-lg")

        For i = 0 To topics.Length - 1
        Set topic = topics(i)
            Cells(x, 1).Value = topic.getElementsByClassName("bold")(0).innerText
            Cells(x, 2).Value = topic.getElementsByClassName("text-block wine-card__region")(0).innerText
            Cells(x, 3).Value = topic.getElementsByClassName("text-inline-block light average__number")(0).innerText
            Cells(x, 4).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
            On Error Resume Next
            Cells(x, 5).Value = topic.getElementsByClassName("wine-card__image")(0).getElementsByTagName("img")(0).src
            x = x + 1
        Next i
     
    End Sub
     
    Last edited: Mar 5, 2017
  2. shahin

    shahin Member

    Messages:
    193
    No "img", "src" I could find in the element. FYC, html elements for the Image portion is given below:

    Code (vb):

    <div class="wine-card__image-wrapper">
    <a href="/wineries/tschida/wines/angerhof-eiswein-gruner-veltliner-2012">
    <figure class="wine-card__image" style="background-image: url(//images.vivino.com/thumbs/qlER3oggQVKh1FZn7YGxZg_375x500.jpg)">
    <div class="image-inner"></div>
    </figure>
    </a>
    </div>
     
  3. Marc L

    Marc L Excel Ninja

    Messages:
    2,911

    Hi !

    So just mod your code to use the correct class name …
  4. shahin

    shahin Member

    Messages:
    193
    Hi Marc L, nice to find you in the loop. If I could understand you properly then you meant this. It is not working either.

    Code (vb):

    Set topics = html.getElementsByClassName("card card-lg")

        For i = 0 To topics.Length - 1
        Set topic = topics(i)
            Cells(x, 1).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
            Cells(x, 2).Value = topic.getElementsByClassName("wine-card__image-wrapper")(0).getElementsByTagName("img")(0).src
            x = x + 1
        Next i
     
  5. Marc L

    Marc L Excel Ninja

    Messages:
    2,911

    So like it does not work on your side, you must change your strategy !

    As per example search for a tag name …
  6. shahin

    shahin Member

    Messages:
    193
    Thanks Marc L for your kind reply. In case of image I have clearly messed up with my code but what about "The Price" portion? Have I not rightly written it?

    Code (vb):

    Set topics = html.getElementsByClassName("wine-price-value")
    For Each topic In topics
        Cells(x, 1).Value = topic.innerText
        x = x + 1
    Next topic
     
  7. shahin

    shahin Member

    Messages:
    193
    FYC, Here is the html elements for the price:
    Code (vb):

    <div class="text-inline-block header-large light wine-price average__number">
    <span class="wine-price-prefix">$</span>
    <span class="wine-price-value">29.98</span>
    <span class="wine-price-suffix"></span>
    </div>
     
  8. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,012
    For image... since URL is within style element in Class, easiest way is to get innerHTML of the wrapper and do string manipulation.
    Ex. To extract entire string in Style="xxx"
    Code (vb):
    Cells(x, 5).Value = Split(Split(topic.getElementsByClassName("wine-card__image-wrapper")(0).innerHTML, "style=""")(1), """>")
    As for price. It's likely filled by script. When page is loaded in IE, values are updated via script. However, responseText will not fire script to fill those values. Thus they are empty.

    Quick and dirty workaround is to load responseText to IE and read values from there.
  9. Marc L

    Marc L Excel Ninja

    Messages:
    2,911


    As always observing how webpage works before to start any code …​
  10. Chihiro

    Chihiro Well-Known Member

    Messages:
    3,012
    Yep. That's usually key to scraping web sites. Not so much VBA skill.

    @shahin
    You can see in image how price info is brought in.
    upload_2017-3-6_11-27-42.png
    shahin likes this.
  11. shahin

    shahin Member

    Messages:
    193
    Thanks sir Chihiro for the demo. Btw, I have got the basic portion from some NATHAN and edited the code for working nicely to parse image. Here it is:

    Code (vb):

    Set topics = html.getElementsByClassName("wine-card__image-wrapper")
    For Each topic In topics
     Cells(x, 1) = "http:" & Replace(Replace(topic.Children(0).Children(0).Style.backgroundImage, "url(", ""), ")", "")
     x = x + 1
    Next topic
     
  12. shahin

    shahin Member

    Messages:
    193
    Another chunk of html elements out of which I can't sift through the Img and src to get the Image. Here is the portion:

    Code (vb):


    <div class="swipe" style="visibility: visible;"><div class="swipe-wrap" style="width: 1500px;"><div data-index="0" style="width: 300px; left: 0px; transition-duration: 0ms; transform: translate(0px, 0px);"><img alt=""class="" src="https://images.craigslist.org/00l0l_hW1X6OeifVZ_300x300.jpg"></div><div data-index="1" style="width: 300px; left: -300px; transition-duration: 0ms;
    transform: translate(300px, 0px);"
    ></div><div data-index="2" style="width: 300px; left: -600px; transition-duration: 0ms; transform: translate(300px, 0px);"></div><div data-index="3" style="width: 300px; left: -900px; transition-duration: 0ms; transform: translate(300px, 0px);"></div>'<div data-index="4" style="width: 300px; left: -1200px; transition-duration: 0ms; transform: translate(-300px, 0px);"></div></div></div>

     
  13. shahin

    shahin Member

    Messages:
    193
    Ain't it possible to parse images using REGEX in this case? Code and html elements are pasted below:
    Code:
    Code (vb):

    Set topics = html.getElementsByClassName("result-image gallery")
    For Each topic In topics
         Cells(x,1)= topic.getElementsByTagName("img")(0).src
         x = x +1
    Next topic
     
    Elements:
    Code (vb):

    <a href="/mnh/atq/6033903864.html"class="result-image gallery" data-ids="1:00l0l_auIVAPKuweh"><img alt=""class="" src="https://images.craigslist.org/00l0l_auIVAPKuweh_300x300.jpg"><span class="result-price">$120</span></a>
     
  14. shahin

    shahin Member

    Messages:
    193
    Dear Marc L, you must be right with what you said but the thing is that I am not good at understanding your point oftentimes because of my incapability. However, in this case even if I use the basic code to get images from a page, it fetches only two "gif" files but not a single "jpg". I tried with:

    PageLink: https://newyork.craigslist.org/search/ata

    Code (vb):

    Set topics = html.getElementsByTagName("img")
    For Each topic In topics
        Cells(x, 1) = topic.src
        x = x + 1
    Next topic
     
  15. Marc L

    Marc L Excel Ninja

    Messages:
    2,911

    I deleted my previous code thinking I was wrong 'cause of late binding
    for objects but trying now early binding I get same result :
    getElementsByClassName collection is always empty …

    Post a complete code, it's far easier to try to follow you !
    And indicating added references.

    As a reminder : a valid code for a webpage often is not on another one !
    At each new webpage, you must start from blank and study its html code …
    shahin likes this.
  16. shahin

    shahin Member

    Messages:
    193
    @Marc L,
    Hi!! Terribly sorry to say that I was not around my pc that is why I got late to respond. I might have mistaken choosing the right class. Anyways, here is the code :

    Code (vb):

    Sub CraigImg()
    Const URL = "https://newyork.craigslist.org/search/ata"
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
    Dim topics As Object, topic As Object
    Dim x As Long
    x = 2
    http.Open "GET", URL, False
    http.send
    html.body.innerHTML = http.responseText
    Set topics = html.getElementsByClassName("result-image gallery")
        For Each topic In topics
            Cells(x, 1) = topic.getElementsByTagName("img")(0).src
            x = x + 1
        Next topic
    End Sub
     
  17. Marc L

    Marc L Excel Ninja

    Messages:
    2,911


    Does this code work on your side ?​
  18. shahin

    shahin Member

    Messages:
    193
    Nope. Not at all.
  19. shahin

    shahin Member

    Messages:
    193
  20. Marc L

    Marc L Excel Ninja

    Messages:
    2,911

    So save responseText to a text file and check it over your code
    - or just use InStr function or Like operator -
    it will be easier to understand !

    Once understood, well read Chihiro's post #8 …
  21. shahin

    shahin Member

    Messages:
    193
    Last edited: Mar 20, 2017
  22. Marc L

    Marc L Excel Ninja

    Messages:
    2,911


    So now you can add images as hyperlinks on cells …​
  23. shahin

    shahin Member

    Messages:
    193
    Yep, successfully.
  24. shahin

    shahin Member

    Messages:
    193
    It was hard to scrape "price" from "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4=" this site until I find selenium webdriver. It scrapes the "Image" and "Price" very smoothly. Here is the code:
    Before running this code make sure "seleniumWrapper type library" is checked in the reference library.

    Code (vb):

    Sub ParsingVivino()
    Dim driver As SeleniumWrapper.WebDriver
    Dim posts As Object, post As Object

    Set driver = New SeleniumWrapper.WebDriver
    driver.Start "Phantomjs", "https://www.vivino.com"
    driver.get "/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
    Set posts = driver.findElementsByClassName("card-lg")

    For Each post In posts
        i = i + 1
        Cells(i, 1) = post.findElementByClassName("link-color-alt-grey").Text
        Cells(i, 2) = post.findElementByClassName("wine-price-value").Text
        Cells(i, 3) = Split(Split(post.findElementByClassName("wine-card__image").getAttribute("style"), "background-image: url(")(1), ");")(0)
    Next post
    Set driver = Nothing: Set posts = Nothing
    End Sub

     
    Last edited: Apr 2, 2017
  25. shahin

    shahin Member

    Messages:
    193
    It also solves the issue that I was facing while parsing Images from craigslist. Here is the code:
    Code (vb):

    Sub CraigslistImage()
    Dim driver As SeleniumWrapper.WebDriver
    Dim posts As Object, post As Object

    Set driver = New SeleniumWrapper.WebDriver
    driver.Start "Phantomjs", "https://newyork.craigslist.org/search"
    driver.get "/ata"
    Set posts = driver.findElementsByClassName("swipe-wrap")

    For Each post In posts
        i = i + 1
        Cells(i, 1) = post.findElementByTagName("img").getAttribute("src")
    Next post
    Set driver = Nothing: Set posts = Nothing
    End Sub
     

Share This Page