My Biggest Regret in the knitr Package - UTF-8, and UTF-8 only, or we cannot be friends | /en/2018/11/biggest-regret-knitr/

yihui 2022-12-17 04:32:54

https://yihui.org/en/2018/11/biggest-regret-knitr/

3 Comments

giscus-bot 2022-12-17 04:32:55

Guest *Joshua Goldberg* @ 2018-11-11 05:07:57 originally posted:

Funny. As you write this I have having a headache trying to read in a basic csv file in python-pandas. I keep getting errors like this: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 3325: invalid start byte." The culprit is typically backslashes or characters that do not belong in the field that is being parsed. But in this case, line 3325 is: "278002";"0425105113";"0". So I am not sure what's happening. The csv can be found here: http://www2.informatik.uni-freiburg.de/~cziegler/BX/.

I am not posting this for help, rather, and mostly ironic, readr's read_delim() had no problem parsing this exact file! It seems the heuristics built into some of R's csv readers are more robust than python/pandas (or the default options are just better). However, in those same data files, there are country names with accent marks that cannot be parsed by utf-8 (I think), so read_delim() gives them a funny value, but it still continues parsing.

Edit: encoding = "ISO-8859-1" seemed to work...I have no clue why.

Anyway, dealing with encoding bugs is quite annoying and a time sink with little value gained after you make it out alive. A common ground that everyone agrees on would be a huge win.

giscus-bot 2022-12-17 04:35:00

Guest *Ljupcho Naumov* @ 2020-06-07 19:22:33 originally posted:

This was the solution in my case as well. "ISO-8859-1" seems to solve troubles with German letters.

giscus-bot 2022-12-17 04:32:55

访客 *Shrek, Tan* @ 2018-11-11 15:12:57 写道：

增加一个默认为utf8的函数可行吗，比如knitr::knit_utf8()？

@shrektan

yihui 2022-12-17 04:32:57

这不是问题的关键，关键是如何让所有用户平稳过渡到 UTF-8 世界，也就是说，怎样才能最小程度上打扰让那些原本不用 UTF-8 编码的 Windows 用户。

——原帖发布于 2018-11-11 17:12:48

giscus-bot 2022-12-17 04:34:58

访客 *Shrek, Tan* @ 2018-11-12 14:23:47 写道：

我的意思是现在可以增加knitr::knit_utf8()这个函数，推荐大家改成这个，然后再慢慢把knitr::knit()给deprecated、defunct掉…

@shrektan

yihui 2022-12-17 04:34:59

knit 是一个简单动词，敲起来方便，也流行多年。重新教育用户换到另一个难敲的名字的话，我觉得既困难又没必要。与其新引入一个函数（这会引发一系列地震，不光是 knitr 包本身，还牵涉到下游的包和编辑器），还不如在 knit 内部逐步取消对非 UTF-8 编码的支持（比如先警告，再报错，最后删掉这个编码参数）。

——原帖发布于 2018-11-12 15:43:48

giscus-bot 2022-12-17 04:32:56

Guest *TC Zhang* @ 2018-11-13 00:11:39 originally posted:

I don't like it when a company wants to create some "unique" things, things that doesn't play well with the rest of the world. the encoding nightmare of windows is one of them. Ports in all the apple products also fall into this category.