Web Scraping

Ramesh Deo · Jul 19, 2016

Hi All,

Hope you doing well and appreciate any help in web scrapping. I have to find out published and not published articles by their quotes which is given in attached example file on https://finance.yahoo.com/quote/DTE and https://beta.finance.yahoo.com/quote/DTE as well.

Here DTE is the quote which is in column B. and the method is to open this webpage for these quotes and then find their articles which is in column A and these articles is tagged in both webpage as hyperlink. for example the quote DTE given in above link and now i need to find the article published on https://finance.yahoo.com/quote/DTE and https://beta.finance.yahoo.com/quote/DTE
"DTE Energy Is Expected to Report Lower Sales in 2Q16"
here is screenshot attached for reference below.

same in Yahoo beta.

Now I want result in column C if these articles tagged on both as "Available" for their quotes which I have. if not tagged then result would be their quotes in column C which is in column B and so on so forth.

Note sometime the articles tagged under summary and sometimes its under News because its depend on quotes where its tagged.
here is screenshot for reference.

I tried to do this but I failed. Here is my code below.

Sub dataextract_yahoo()
'the_start:
On Error Resume Next
Dim ie As New InternetExplorer
Dim i As Long
Dim lr As Long
Dim quotes As String
Dim tst As Variant
'Dim ydate As String
Dim article As String

lr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Row + 1

i = 2
quotes = Sheet1.Range("B" & i).Value
article = Sheet1.Range("A" & i).Value

ie.Visible = True

'On Error Resume Next
ie.navigate "http://finance.yahoo.com/quote/" & quotes

Do
DoEvents
' If Err.Number <> 0 Then
' ie.Quit
' Set ie = Nothing
' GoTo the_start
' End If

Loop Until ie.readyState = READYSTATE_COMPLETE
'Application.StatusBar = " loading website.."

Dim doc As HTMLDocument
Dim output As Object
Dim tst As Variant
Set doc = ie.document

Set output = doc.getElementsByTagName("div")

For Each tst In output
If tst.innerText = article Then
Sheet1.Range("C" & lr).Value = "Available"
Else
Sheet1.Range("C" & lr).Value = quotes
End If
Next tst
ie.quite
i = i + 1
End Sub

Thanks for any help!!

Chihiro · Jul 19, 2016

Instead of navigating IE. Use XML to send "GET" request.
You can then search for string within responseText to find what you are looking for.

See below for tutorial. I find this site to be one of the best web scraping resource for Excel users.

http://analystcave.com/web-scraping-tutorial/

Chihiro · Jul 19, 2016

Not sure if understand your requirement completely. But to get at info you want, first, view-source of the page and locate the string you are interested in.

You should find following.

Code:

<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid=".owk0jbkxz4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.0.0.1.0.$newsStream.$stream.1.0:$list-0.$item-0.0.4.3.0.1">DTE Energy Is Expected to Report Lower Sales in 2Q16</div>

Now you will need to compare with few others like it and find a pattern you can use to extract the info you want.

See attached sample code. What I used to extract Prev Close:Rate for currency from Yahoo Finance.

Code is 2 parts. 1st part to send Get request to server and 2nd part to manipulate returned response to extract the info.

Ramesh Deo · Jul 19, 2016

Chihiro said:
Not sure if understand your requirement completely. But to get at info you want, first, view-source of the page and locate the string you are interested in.

You should find following.

Code:

<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid=".owk0jbkxz4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.0.0.1.0.$newsStream.$stream.1.0:$list-0.$item-0.0.4.3.0.1">DTE Energy Is Expected to Report Lower Sales in 2Q16</div>

Now you will need to compare with few others like it and find a pattern you can use to extract the info you want.

See attached sample code. What I used to extract Prev Close:Rate for currency from Yahoo Finance.

Code is 2 parts. 1st part to send Get request to server and 2nd part to manipulate returned response to extract the info.

Hi Chihiro, Thanks for responding! I need to find exactly as you mentioned in this thread. Thanks for sharing example file to compare with this code. Am trying to resolve.

Thanks a lot.

Marc L · Jul 20, 2016

Hi Chihiro !

Chihiro said:
See attached sample code.

Two points :

• Create object request variable once before the loop !
It may be faster than re-create it at each iteration …

• Free variable object after the loop …

Code:

Sub Demo()
        Const URL = "http://finance.yahoo.com/q?s=", DL = "Prev Close:</th><td class=""yfnc_tabledata1"">"
        Dim oReq As Object, R&
        Set oReq = CreateObject("MSXML2.serverXMLHTTP")
    With Sheet1.Cells(1).CurrentRegion.Rows
            .Range("C2", .Cells(.Count, 3)).ClearContents
        For R = 2 To .Count
            oReq.Open "GET", URL & .Cells(R, 1).Text & .Cells(R, 2).Text & "=X"
            oReq.setrequestHeader "DNT", "1"
            oReq.Send
            .Cells(R, 3).Value = Val(Split(oReq.responseText, DL)(1))
        Next
    End With
        Set oReq = Nothing
End Sub

Chihiro · Jul 20, 2016

Thanks Marc for pointing out improvement area.

Ramesh Deo · Jul 20, 2016

Hi Mark,

Thanks for responding!
This is an example to pull a fixed html text but in my requirement there is no fixed number where the articles are tagged.

Chihiro · Jul 20, 2016

Hmm? You should be able to search for div class.
For an example, below portion remains consistent for latest News (first item).

Code:

<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid="

If you need to go through different tabs (in a page) you may need to delve deeper into coding. See image below for example, when you have "Press Releases" active instead of "Latest News".

Marc L · Jul 20, 2016

It seems that needs to first load responseText in an HTML file,
then scan the innerText of a root ID like in this thread …

Ramesh Deo · Jul 21, 2016

Marc L said:
It seems that needs to first load responseText in an HTML file,
then scan the innerText of a root ID like in this thread …

Hi Mark,
Still I am able not to track the inner text where its tagged because there is no unique id to find the exact article for that ticker. I tried to do it but failed.
I would so much appreciate it if you give an example with my requirement.

Thank you

Marc L · Jul 21, 2016

I took a glance : there is always a parent ID
but only the 3 first articles are in the responseText !

So you may have to inspect deeper the HTMLfile object
to find out where others are hidden …

But its seems with a webbrowser there are called during the progression
of the vertical scrollbar by some JScript requests :
so it may be very difficult to trap / check out these articles !

Ramesh Deo · Jul 21, 2016

Marc L said:
I took a glance : there is always a parent ID
but only the 3 first articles are in the responseText !

So you may have to inspect deeper the HTMLfile object
to find out where others are hidden …

But its seems with a webbrowser there are called during the progression
of the vertical scrollbar by some JScript requests :
so it may be very difficult to trap / check out these articles !

Hi Mark, Thanks for looking into this! These articles are of 3 days old. I need to trap only 1 days back articles. for example I need to find out today articles of 19th July. so I think these articles will be in responseText.

Here I have attached the file of 1 days back articles with their tickers. please have look and appreciate if modify the code.

Thanks!

Chihiro · Jul 21, 2016

Try this code.

Code:

Sub TestCheck()
Dim oReq As Object
Dim url As String, wText As String
Dim i As Integer, lRow As Integer

Set oReq = CreateObject("MSXML2.XMLHttp")

lRow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 2 To lRow
url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value
        With oReq
            .Open "POST", url & "/news", False
            .setRequestHeader "Content-Type", "text/xml"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
            .send

            wText = CStr(.responseText)
            If InStr(1, wText, Cells(i, 1).Value) > 0 Then
                Cells(i, 4).Value = "Tagged"
            Else
                Cells(i, 4).Value = "Not Tagged"
            End If
        End With
Next
Set oReq = Nothing
End Sub

See attached demo.

FYI - A5 had wrong char. You had "'" (Char(39)) instead of "’" (Char(146)).

Ramesh Deo · Jul 21, 2016

Chihiro said:

Try this code.

Code:

Sub TestCheck()
Dim oReq As Object
Dim url As String, wText As String
Dim i As Integer, lRow As Integer

Set oReq = CreateObject("MSXML2.XMLHttp")

lRow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 2 To lRow
url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value
        With oReq
            .Open "POST", url & "/news", False
            .setRequestHeader "Content-Type", "text/xml"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
            .send

            wText = CStr(.responseText)
            If InStr(1, wText, Cells(i, 1).Value) > 0 Then
                Cells(i, 4).Value = "Tagged"
            Else
                Cells(i, 4).Value = "Not Tagged"
            End If
        End With
Next
Set oReq = Nothing
End Sub

See attached demo.

FYI - A5 had wrong char. You had "'" (Char(39)) instead of "’" (Char(146)).

Hi Chihiro,
Thank you so much!

Its working for only in case of the articles tagged under summary but if the articles tagged under news then the result brings as not tagged. For Example I need to find the article of ticker GXC. then url will be changed as http://finance.yahoo.com/quote/GXC/news, then we can find the article against that ticker.
I tried to modify in the code url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value &"/news" instead of url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value because I observed that if I do it mannualy the url will be http://finance.yahoo.com/quote/GXC/news for that articles which is tagged under news, thereafter if i have the article which is tagged under summury then I need to simply put MO in place of GXC in url, it will autometically brings summary page. however its brings error.

I have attached again for the reference. please have a look.

Chihiro · Jul 21, 2016

Actually, if you look at my code. It's hard coded with "/news".

What you can do is second IF statement within the existing IF statement.

Basically, if first check failed to find Instr()>0 you send another "POST" request with new URL string.

I'm bit busy at the moment. But if you need further help, let me know.

Chihiro · Jul 21, 2016

Ah, I see where the issue is. You are using Char(39) instead of Char(146) in each of those that "Not Tagged" was returned when it should be "Tagged".

What I noticed when looking at the responseText, is that it has no relevance whether "/News" is added or not.

I modified code to replace Char(39) with Char(146) automatically.

Also, I believe you have wrong ticker for row 6. I believe it should be AMTD, not SPY.

See attached.

Note: You may want to create some sort of verification or trapping that Ticker and name are what it's supposed to be.

Ramesh Deo · Jul 21, 2016

Chihiro said:
Ah, I see where the issue is. You are using Char(39) instead of Char(146) in each of those that "Not Tagged" was returned when it should be "Tagged".

What I noticed when looking at the responseText, is that it has no relevance whether "/News" is added or not.

I modified code to replace Char(39) with Char(146) automatically.

Also, I believe you have wrong ticker for row 6. I believe it should be AMTD, not SPY.

See attached.

Note: You may want to create some sort of verification or trapping that Ticker and name are what it's supposed to be.

Aha! It working now. Thank you so much for resolving this issue.

Web Scraping

Member

Attachments

Excel Ninja

Excel Ninja

Attachments

Member

Excel Ninja

Excel Ninja

Member

Excel Ninja

Excel Ninja

Member

Excel Ninja

Member

Attachments

Excel Ninja

Attachments

Member

Attachments

Excel Ninja

Excel Ninja

Attachments

Member