• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web Scraping

Hi All,

Hope you doing well and appreciate any help in web scrapping. I have to find out published and not published articles by their quotes which is given in attached example file on https://finance.yahoo.com/quote/DTE and https://beta.finance.yahoo.com/quote/DTE as well.

Here DTE is the quote which is in column B. and the method is to open this webpage for these quotes and then find their articles which is in column A and these articles is tagged in both webpage as hyperlink. for example the quote
DTE given in above link and now i need to find the article published on https://finance.yahoo.com/quote/DTE and https://beta.finance.yahoo.com/quote/DTE
"DTE Energy Is Expected to Report Lower Sales in 2Q16"
here is screenshot attached for reference below.

upload_2016-7-19_5-27-0.png


same in Yahoo beta.

Now I want result in column C if these articles tagged on both as "Available" for their quotes which I have. if not tagged then result would be their quotes in column C which is in column B and so on so forth.

Note sometime the articles tagged under summary and sometimes its under News because its depend on quotes where its tagged.
here is screenshot for reference.

upload_2016-7-19_6-12-14.png


I tried to do this but I failed. Here is my code below.

Sub dataextract_yahoo()
'the_start:
On Error Resume Next
Dim ie As New InternetExplorer
Dim i As Long
Dim lr As Long
Dim quotes As String
Dim tst As Variant
'Dim ydate As String
Dim article As String



lr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Row + 1

i = 2
quotes = Sheet1.Range("B" & i).Value
article = Sheet1.Range("A" & i).Value




ie.Visible = True

'On Error Resume Next
ie.navigate "http://finance.yahoo.com/quote/" & quotes

Do
DoEvents
' If Err.Number <> 0 Then
' ie.Quit
' Set ie = Nothing
' GoTo the_start
' End If


Loop Until ie.readyState = READYSTATE_COMPLETE
'Application.StatusBar = " loading website.."


Dim doc As HTMLDocument
Dim output As Object
Dim tst As Variant
Set doc = ie.document

Set output = doc.getElementsByTagName("div")


For Each tst In output
If tst.innerText = article Then
Sheet1.Range("C" & lr).Value = "Available"
Else
Sheet1.Range("C" & lr).Value = quotes
End If
Next tst
ie.quite
i = i + 1
End Sub

Thanks for any help!!
 

Attachments

  • yahoo data test.xlsx
    9.9 KB · Views: 3
Not sure if understand your requirement completely. But to get at info you want, first, view-source of the page and locate the string you are interested in.

You should find following.

Code:
<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid=".owk0jbkxz4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.0.0.1.0.$newsStream.$stream.1.0:$list-0.$item-0.0.4.3.0.1">DTE Energy Is Expected to Report Lower Sales in 2Q16</div>

Now you will need to compare with few others like it and find a pattern you can use to extract the info you want.

See attached sample code. What I used to extract Prev Close:Rate for currency from Yahoo Finance.

Code is 2 parts. 1st part to send Get request to server and 2nd part to manipulate returned response to extract the info.
 

Attachments

  • Currency_YahooFinance.xlsm
    18 KB · Views: 10
Not sure if understand your requirement completely. But to get at info you want, first, view-source of the page and locate the string you are interested in.

You should find following.

Code:
<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid=".owk0jbkxz4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.0.0.1.0.$newsStream.$stream.1.0:$list-0.$item-0.0.4.3.0.1">DTE Energy Is Expected to Report Lower Sales in 2Q16</div>

Now you will need to compare with few others like it and find a pattern you can use to extract the info you want.

See attached sample code. What I used to extract Prev Close:Rate for currency from Yahoo Finance.

Code is 2 parts. 1st part to send Get request to server and 2nd part to manipulate returned response to extract the info.

Hi Chihiro, Thanks for responding! I need to find exactly as you mentioned in this thread. Thanks for sharing example file to compare with this code. Am trying to resolve.

Thanks a lot.
 
Hi Chihiro !​
See attached sample code.
Two points :

• Create object request variable once before the loop !
It may be faster than re-create it at each iteration …

• Free variable object after the loop …​
Code:
Sub Demo()
        Const URL = "http://finance.yahoo.com/q?s=", DL = "Prev Close:</th><td class=""yfnc_tabledata1"">"
        Dim oReq As Object, R&
        Set oReq = CreateObject("MSXML2.serverXMLHTTP")
    With Sheet1.Cells(1).CurrentRegion.Rows
            .Range("C2", .Cells(.Count, 3)).ClearContents
        For R = 2 To .Count
            oReq.Open "GET", URL & .Cells(R, 1).Text & .Cells(R, 2).Text & "=X"
            oReq.setrequestHeader "DNT", "1"
            oReq.Send
            .Cells(R, 3).Value = Val(Split(oReq.responseText, DL)(1))
        Next
    End With
        Set oReq = Nothing
End Sub
 
Hi Mark,

Thanks for responding!
This is an example to pull a fixed html text but in my requirement there is no fixed number where the articles are tagged.
 
Hmm? You should be able to search for div class.
For an example, below portion remains consistent for latest News (first item).
Code:
<div class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Pos(r) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 C(#000) C(#0078ff):h" data-reactid="

If you need to go through different tabs (in a page) you may need to delve deeper into coding. See image below for example, when you have "Press Releases" active instead of "Latest News".

upload_2016-7-20_10-49-34.png
 
It seems that needs to first load responseText in an HTML file,
then scan the innerText of a root ID like in this thread


Hi Mark,
Still I am able not to track the inner text where its tagged because there is no unique id to find the exact article for that ticker. I tried to do it but failed.
I would so much appreciate it if you give an example with my requirement.

Thank you
 
I took a glance : there is always a parent ID
but only the 3 first articles are in the responseText !

So you may have to inspect deeper the HTMLfile object
to find out where others are hidden …

But its seems with a webbrowser there are called during the progression
of the vertical scrollbar by some JScript requests :
so it may be very difficult to trap / check out these articles !
 
I took a glance : there is always a parent ID
but only the 3 first articles are in the responseText !

So you may have to inspect deeper the HTMLfile object
to find out where others are hidden …

But its seems with a webbrowser there are called during the progression
of the vertical scrollbar by some JScript requests :
so it may be very difficult to trap / check out these articles !

Hi Mark, Thanks for looking into this! These articles are of 3 days old. I need to trap only 1 days back articles. for example I need to find out today articles of 19th July. so I think these articles will be in responseText.

Here I have attached the file of 1 days back articles with their tickers. please have look and appreciate if modify the code.

Thanks!
 

Attachments

  • yahoo data.xlsm
    22.7 KB · Views: 5
Try this code.
Code:
Sub TestCheck()
Dim oReq As Object
Dim url As String, wText As String
Dim i As Integer, lRow As Integer

Set oReq = CreateObject("MSXML2.XMLHttp")

lRow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 2 To lRow
url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value
        With oReq
            .Open "POST", url & "/news", False
            .setRequestHeader "Content-Type", "text/xml"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
            .send

            wText = CStr(.responseText)
            If InStr(1, wText, Cells(i, 1).Value) > 0 Then
                Cells(i, 4).Value = "Tagged"
            Else
                Cells(i, 4).Value = "Not Tagged"
            End If
        End With
Next
Set oReq = Nothing
End Sub

See attached demo.

FYI - A5 had wrong char. You had "'" (Char(39)) instead of "’" (Char(146)).
 

Attachments

  • yahoo data_Sample.xlsm
    23.6 KB · Views: 8
Try this code.
Code:
Sub TestCheck()
Dim oReq As Object
Dim url As String, wText As String
Dim i As Integer, lRow As Integer

Set oReq = CreateObject("MSXML2.XMLHttp")

lRow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 2 To lRow
url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value
        With oReq
            .Open "POST", url & "/news", False
            .setRequestHeader "Content-Type", "text/xml"
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
            .send

            wText = CStr(.responseText)
            If InStr(1, wText, Cells(i, 1).Value) > 0 Then
                Cells(i, 4).Value = "Tagged"
            Else
                Cells(i, 4).Value = "Not Tagged"
            End If
        End With
Next
Set oReq = Nothing
End Sub

See attached demo.

FYI - A5 had wrong char. You had "'" (Char(39)) instead of "’" (Char(146)).

Hi Chihiro,
Thank you so much!

Its working for only in case of the articles tagged under summary but if the articles tagged under news then the result brings as not tagged. For Example I need to find the article of ticker GXC. then url will be changed as http://finance.yahoo.com/quote/GXC/news, then we can find the article against that ticker.
I tried to modify in the code url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value &"/news" instead of url = "https://finance.yahoo.com/quote/" & Cells(i, 2).Value because I observed that if I do it mannualy the url will be http://finance.yahoo.com/quote/GXC/news for that articles which is tagged under news, thereafter if i have the article which is tagged under summury then I need to simply put MO in place of GXC in url, it will autometically brings summary page. however its brings error.

I have attached again for the reference. please have a look.
 

Attachments

  • yahoo data_Sample.xlsm
    20.5 KB · Views: 1
Actually, if you look at my code. It's hard coded with "/news".

What you can do is second IF statement within the existing IF statement.

Basically, if first check failed to find Instr()>0 you send another "POST" request with new URL string.

I'm bit busy at the moment. But if you need further help, let me know.
 
Ah, I see where the issue is. You are using Char(39) instead of Char(146) in each of those that "Not Tagged" was returned when it should be "Tagged".

What I noticed when looking at the responseText, is that it has no relevance whether "/News" is added or not.

I modified code to replace Char(39) with Char(146) automatically.

Also, I believe you have wrong ticker for row 6. I believe it should be AMTD, not SPY.

See attached.

Note: You may want to create some sort of verification or trapping that Ticker and name are what it's supposed to be.
 

Attachments

  • yahoo data_Sample (1).xlsm
    21.4 KB · Views: 9
Ah, I see where the issue is. You are using Char(39) instead of Char(146) in each of those that "Not Tagged" was returned when it should be "Tagged".

What I noticed when looking at the responseText, is that it has no relevance whether "/News" is added or not.

I modified code to replace Char(39) with Char(146) automatically.

Also, I believe you have wrong ticker for row 6. I believe it should be AMTD, not SPY.

See attached.

Note: You may want to create some sort of verification or trapping that Ticker and name are what it's supposed to be.

Aha! It working now. Thank you so much for resolving this issue.
 
Back
Top