Grabbing Tables in Webpages Using the XML Package | /en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/

yihui 2022-12-16 19:33:26

https://yihui.org/en/2010/10/grabbing-tables-in-webpages-using-the-xml-package/

10 Comments

giscus-bot 2022-12-16 19:33:27

Guest *Larry (IEOR Tools)* @ 2010-10-25 16:54:08 originally posted:

Great post Yihui. I too made a similar post about using XML to grab table data.

http://industrialengineertools.blogspot.com/2010/08/ieor-tools-tutorial-learning-xml-with-r.html

I wish to increase my knowledge of this package. XML with R is great!

yihui 2022-12-16 19:33:36

That's pretty interesting. Thanks!

Originally posted on 2010-10-25 17:39:32

giscus-bot 2022-12-16 19:33:27

Guest *Tieming* @ 2010-10-30 02:47:09 originally posted:

Good post. I found it to be useful.

giscus-bot 2022-12-16 19:33:28

Guest *Susan* @ 2010-10-30 19:19:51 originally posted:

I've been having trouble getting the "names" rows out when there are spanning elements - it appears R reads in each cell without regard to how many columns it takes up, and then puts "NA" in for any remaining empty cells. Any ideas as to how to make that work? It gets particularly bad when "rowspan" and "colspan" in the html are both not equal to 1.

yihui 2022-12-16 19:33:37

Tables with rowspan or colspan are certainly not straightforward to deal with. I don't have an immediate solution either. What I can think of is we can go back to the ``stone age'' -- to manipulate the texts. This is a rather bad idea and difficult to generalize (e.g. you can deal with this page in this way, but it does not work in that page). Sorry.

Originally posted on 2010-11-01 19:46:36

giscus-bot 2022-12-16 19:33:39

Guest *Susan* @ 2010-11-14 01:30:11 originally posted:

Ok, but there is no method by which readHTMLtable() can use rowspan and colspan to modify its tables? I'm not sure how to get a moderately flexible method for doing that, as the tables I'm reading change quite a bit. It's really very frustrating. Do you know whether there might be a method using regexp to read the tables?

yihui 2022-12-16 19:33:40

Regular expressions might help, and I am not sure if XML is a better way. I feel regexp is a ``darker'' way... Maybe you can try the hybrid: use XML to extract the table elements and use regexp to deal with (row|col)span. I also suggest you write to Duncan Temple Lang for his wisdom.

Originally posted on 2010-11-14 04:23:01

giscus-bot 2022-12-16 19:33:29

Guest *Sam I Am* @ 2010-11-08 04:20:42 originally posted:

Very nice. I am faced with a related problem. Some of the columns in the table are in a foreign language. How can I skip those? Thanks.

yihui 2022-12-16 19:33:38

I'm not sure what ``foreign'' means here. Regular expressions can deal with texts in a very flexible way. For example, the 68th line above showed how to filter out the characters which are not digits. If you want to remove the characters which are not in, say, a-zA-Z0-9, you may use `gsub('[^a-zA-Z0-9]', '', your.string)`. See `?regexp` for details.

Originally posted on 2010-11-08 08:40:33

giscus-bot 2022-12-16 19:33:30

Guest *Janko* @ 2010-11-24 19:42:50 originally posted:

Did anyone of you find anything useful regarding scraping webpages where some information is "hidden" bei javaScripts (so all pages involving AJAX) yet? I know that it's possible to scrape such pages with Ruby in Combination with Waitir, but I'm looking for some clues on how to do it from R for quite a while now. Does anyone of you know if CURL (RCurl) is able to to it somehow?

giscus-bot 2022-12-16 19:33:31

Guest *Mitchell Wachtel* @ 2011-03-09 09:41:30 originally posted:

What if the tables are determined by the fill ins of a box?

This is the webpage:

http://www.tceq.texas.gov/cgi-bin/compliance/monops/yearly_summary.pl

You have to fill in two or three boxes to get to a table.

How do you do this using XML?

yihui 2022-12-16 19:33:41

Sorry, I have no idea. I was asked the similar question in the 3rd comment above.

Originally posted on 2011-03-09 12:32:12

giscus-bot 2022-12-16 19:33:32

Guest *Danny* @ 2011-12-22 08:07:14 originally posted:

Hi,Yihui.
It seems that it is very easy to get the data of static webs,could R get the data of active web eary,too?
Million thanks .

giscus-bot 2022-12-16 19:33:33

Guest *reretry* @ 2012-08-22 08:05:03 originally posted:

Thank you very much.
Your posting is very helpful to me.

giscus-bot 2022-12-16 19:33:34

Guest *Martin* @ 2014-06-18 23:42:25 originally posted:

Thanks! this helped me.

giscus-bot 2022-12-16 19:33:35

Guest *fatma* @ 2016-02-19 07:29:50 originally posted:

I using the website: "http://www.atb.com.tn/devise" to extract data from the table but nothing works and this is the code:

sess <- html_session("http://www.atb.com.tn/", user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"))

pg <- jump_to(sess, "http://www.atb.com.tn/devise")

dat <- content(pg$response, as="parsed", encoding= "UTF-8")

table <- html_table(html_nodes(dat, "table")[[2]], header=TRUE)
the last line make an error:
no method to 'html_table' applicable to a class object "xml_node"
thanks.