Grabbing Tables in Webpages Using the XML Package | /en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/
Grabbing Tables in Webpages Using the XML Package
https://yihui.org/en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/
https://yihui.org/en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/
Guest *Larry (IEOR Tools)* @ 2010-10-25 16:54:08 originally posted:
Great post Yihui. I too made a similar post about using XML to grab table data.
http://industrialengineertools.blogspot.com/2010/08/ieor-tools-tutorial-learning-xml-with-r.html
I wish to increase my knowledge of this package. XML with R is great!
That's pretty interesting. Thanks!
Originally posted on 2010-10-25 17:39:32
Guest *Tieming* @ 2010-10-30 02:47:09 originally posted:
Good post. I found it to be useful.
Guest *Susan* @ 2010-10-30 19:19:51 originally posted:
I've been having trouble getting the "names" rows out when there are spanning elements - it appears R reads in each cell without regard to how many columns it takes up, and then puts "NA" in for any remaining empty cells. Any ideas as to how to make that work? It gets particularly bad when "rowspan" and "colspan" in the html are both not equal to 1.
Tables with rowspan or colspan are certainly not straightforward to deal with. I don't have an immediate solution either. What I can think of is we can go back to the ``stone age'' -- to manipulate the texts. This is a rather bad idea and difficult to generalize (e.g. you can deal with this page in this way, but it does not work in that page). Sorry.
Originally posted on 2010-11-01 19:46:36
Guest *Susan* @ 2010-11-14 01:30:11 originally posted:
Ok, but there is no method by which readHTMLtable() can use rowspan and colspan to modify its tables? I'm not sure how to get a moderately flexible method for doing that, as the tables I'm reading change quite a bit. It's really very frustrating. Do you know whether there might be a method using regexp to read the tables?
Regular expressions might help, and I am not sure if XML is a better way. I feel regexp is a ``darker'' way... Maybe you can try the hybrid: use XML to extract the table elements and use regexp to deal with (row|col)span. I also suggest you write to Duncan Temple Lang for his wisdom.
Originally posted on 2010-11-14 04:23:01
Guest *Sam I Am* @ 2010-11-08 04:20:42 originally posted:
Very nice. I am faced with a related problem. Some of the columns in the table are in a foreign language. How can I skip those? Thanks.
I'm not sure what ``foreign'' means here. Regular expressions can deal with texts in a very flexible way. For example, the 68th line above showed how to filter out the characters which are not digits. If you want to remove the characters which are not in, say, a-zA-Z0-9, you may use `gsub('[^a-zA-Z0-9]', '', your.string)`. See `?regexp` for details.
Originally posted on 2010-11-08 08:40:33
Guest *Janko* @ 2010-11-24 19:42:50 originally posted:
Did anyone of you find anything useful regarding scraping webpages where some information is "hidden" bei javaScripts (so all pages involving AJAX) yet? I know that it's possible to scrape such pages with Ruby in Combination with Waitir, but I'm looking for some clues on how to do it from R for quite a while now. Does anyone of you know if CURL (RCurl) is able to to it somehow?
Guest *Mitchell Wachtel* @ 2011-03-09 09:41:30 originally posted:
What if the tables are determined by the fill ins of a box?
This is the webpage:
http://www.tceq.texas.gov/cgi-bin/compliance/monops/yearly_summary.pl
You have to fill in two or three boxes to get to a table.
How do you do this using XML?
Sorry, I have no idea. I was asked the similar question in the 3rd comment above.
Originally posted on 2011-03-09 12:32:12
Guest *Danny* @ 2011-12-22 08:07:14 originally posted:
Hi,Yihui.
It seems that it is very easy to get the data of static webs,could R get the data of active web eary,too?
Million thanks .
Guest *reretry* @ 2012-08-22 08:05:03 originally posted:
Thank you very much.
Your posting is very helpful to me.
Guest *Martin* @ 2014-06-18 23:42:25 originally posted:
Thanks! this helped me.
Guest *fatma* @ 2016-02-19 07:29:50 originally posted:
I using the website: "http://www.atb.com.tn/devise" to extract data from the table but nothing works and this is the code:
sess <- html_session("http://www.atb.com.tn/", user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"))
pg <- jump_to(sess, "http://www.atb.com.tn/devise")
dat <- content(pg$response, as="parsed", encoding= "UTF-8")
table <- html_table(html_nodes(dat, "table")[[2]], header=TRUE)
the last line make an error:
no method to 'html_table' applicable to a class object "xml_node"
thanks.
Sign in to join the discussion
Sign in with GitHub