Pythonで値がNaN（Not a Number）、NaT（Not a Time）、None、unknownの違いって？

f:id:ts0818:20210114203048j:plain

scienceportal.jst.go.jp

　遊び心をもって”真面目にふざける”ことで、人々が当たり前のように思っている物事の裏にある学術的な原理を明らかにすることも、研究においては重要だといわれる。今回の研究も、鳥類の研究を行っていたステファン・レーバーさんらウィーン大学の研究グループが、鳥類もその一系統である爬虫類 (はちゅう類) 全体に目を向け、「トリ同様、ワニも鳴くのだから、ヘリウムガスを吸わせれば何かがわかる」と考えたところから始まった。そして、ヘリウムガスを吸わせたときの鳴き声を分析することで、爬虫類であるワニが、鳥類や哺乳類と同様に、声道中の空気の共鳴という原理をつかって声を出していることを明らかにした。

ワニの発声原理を解明しイグノーベル賞！～遊び心と人のつながりが新発見に～ | Science Portal - 科学技術の最新情報サイト「サイエンスポータル」

⇧ 日本の方が受賞してたって聞いてたけど、発起人は外国の方だったんですね。

というわけで、今回はPythonについてです。

レッツトライ～。

PythonでNaN（Not a Number）値とは？

公式のドキュメントの説明によると、

docs.python.org

IEEE 754 特殊値 NaN、inf、-inf は IEEE の規則に従って処理されます。具体的には、NaN は自身を含めたあらゆる値に近いとは見なされません。 inf と -inf は自身とのみ近いと見なされます。

https://docs.python.org/ja/3/library/math.html#math.isclose

⇧ ってな感じで、「IEEE 754 特殊値」というもので、「IEEE」の規則に従うのだと。

「IEEE」って？

IEEE（アイ・トリプル・イー、Institute of Electrical and Electronics Engineers）は、アメリカ合衆国に本部を置く電気・情報工学分野の学術研究団体（学会）、技術標準化機関である。

IEEE - Wikipedia

⇧ ってな感じで、技術標準化機関ってことらしいので、「IEEE」の規則に従ったプログラミング言語で、且つ「IEEE 754」を導入してるのであれば、「Python」に限らず「NaN（Not a Number）」の扱いは共通してるってことですかね。

Wikipediaさんによると、「IEEE 754」は、

IEEE 754（アイトリプルイーななごおよん、アイトリプルイーななひゃくごじゅうよん）は、その標記 IEEE Standard for Floating-Point Arithmetic のとおり、IEEE標準のひとつであり、浮動小数点算術に関する標準である。

IEEE 754 - Wikipedia

GNU coreutilsのマニュアルで「Almost all modern systems use IEEE-754 floating point」と書かれているように、ほぼ全てのmodernなシステムが使っている浮動小数点方式（の仕様）であり、多くの、プロセッサ内部のあるいは独立の、FPUなどといったハードウェア、あるいは浮動小数点演算ライブラリといったコンピュータ・プログラムが採用している。

IEEE 754 - Wikipedia

なお、多くのコンピュータ・プログラミング言語ないしその処理系の仕様では、IEEE 754 による処理とは明記していないことが多い。これは、異なる仕様を採用しているハードウェア上に実装する際のコストへの考慮のためである。

IEEE 754 - Wikipedia

ただしそういった言語や処理系でも、実機のシステムがIEEE 754準拠であれば、結果としてはIEEE 754準拠になる。あまり多くはないが、JavaやC#のように、言語仕様で IEEE 754 を明記しているものもある（ただしそのような仕様を厳密に実装するのはそう簡単ではない場合もある）。

IEEE 754 - Wikipedia

⇧ ってな感じで、まさかの「ハードウェア」依存というね...

脱線しましたが、少なくともPythonのmathモジュールにおける「NaN（Not a Number）値」については、「IEEE」に準拠するってことですと。

CPython implementation detail: math モジュールは、ほとんどが実行プラットフォームにおける C 言語の数学ライブラリ関数に対する薄いラッパでできています。例外時の挙動は、適切である限り C99 標準の Annex F に従います。現在の実装では、sqrt(-1.0) や log(0.0) といった (C99 Annex F で不正な演算やゼロ除算を通知することが推奨されている) 不正な操作に対して ValueError を送出し、(例えば exp(1000.0) のような) 演算結果がオーバーフローする場合には OverflowError を送出します。上記の関数群は、1つ以上の引数が NaN であった場合を除いて NaN を返しません。引数に NaN が与えられた場合は、殆どの関数は NaN を返しますが、 (C99 Annex F に従って) 別の動作をする場合があります。例えば、 pow(float('nan'), 0.0) や hypot(float('nan'), float('inf')) といった場合です。訳注: 例外が発生せずに結果が返ると、計算結果がおかしくなった原因が複素数を渡したためだということに気づくのが遅れる可能性があります。

Python は signaling NaN と quiet NaN を区別せず、signaling NaN に対する挙動は未定義とされていることに注意してください。典型的な挙動は、全ての NaN を quiet NaN として扱うことです。

https://docs.python.org/ja/3/library/math.html#math.nan

⇧ 内部的にはC言語の実装に依存してるので、若干、注意事項があるみたいね...

mathモジュールには、

math.isnan(x)

x がNaN (not a number、非数) の時に True を返します。それ以外の場合には False を返します。

https://docs.python.org/ja/3/library/math.html#math.isnan

⇧「NaN（Not a Number）値」かどうかを判定してくれるメソッドが用意されてますと。

Pythonの拡張ライブラリでも「NaN（Not a Number）値」かどうかを判定してくれるメソッドが用意されてますと。

■Numpy

■Pandas

■Pandas.DataFrame

■Pandas.Series

ちなみに、「Pandas」の「DataFrame」と「Series」の違いについては、

qiita.com

⇧ 上記サイト様が詳しいです。

Pythonで値がNaT（Not a Time）とは？

どうやら、Python標準では存在しない値っぽいですね。Pythonの拡張ライブラリである「Numpy」とかを使ってる場合に登場する「datetime64」って「型」の不正な値を表す値ってことっぽいのかね？

numpy.org

The most basic way to create datetimes is from strings in ISO 8601 date or datetime format. The unit for internal storage is automatically selected from the form of the string, and can be either a date unit or a time unit. The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’), and some additional SI-prefix seconds-based units. The datetime64 data type also accepts the string “NAT”, in any combination of lowercase/uppercase letters, for a “Not A Time” value.

https://numpy.org/doc/stable/reference/arrays.datetime.html#datetimes-and-timedeltas

⇧ ってな感じで、「datetime64」っていう「データ型」なら「NaT（Not a Time）」って値を扱えると言ってるのだけれど、

Pythonの標準で用意されてる「データ型」を確認してみたところ、

docs.python.org

⇧「datetime64」っていう「データ型」は存在しませんと。

Starting in NumPy 1.7, there are core array data types which natively support datetime functionality. The data type is called “datetime64”, so named because “datetime” is already taken by the datetime library included in Python.

https://numpy.org/doc/stable/reference/arrays.datetime.html#datetimes-and-timedeltas

⇧「datetime64」ってのは、「Numpy 1.7」から登場した「データ型」だったようですね。

Pythonで値がNoneとは？

ネットで調べた限りだと、

stackoverflow.com

None is just a value that commonly is used to signify 'empty', or 'no value here'. It is a signal object; it only has meaning because the Python documentation says it has that meaning.

python - What is a None value? - Stack Overflow

This is what the Python documentation has got to say about None:

The sole value of types.NoneType. None is frequently used to represent the absence of a value, as when default arguments are not passed to a function.

Changed in version 2.4: Assignments to None are illegal and raise a SyntaxError.

Note The names None and debug cannot be reassigned (assignments to them, even as an attribute name, raise SyntaxError), so they can be considered “true” constants.

python - What is a None value? - Stack Overflow

⇧ う、う～ん...

上記のPythonのドキュメントの説明が、バージョン2系なので、3系を確認してみました。

docs.python.org

None

型 NoneType の唯一の値です。 None は、関数にデフォルト引数が渡されなかったときなどに、値の非存在を表すのに頻繁に用いられます。 None への代入は不正で、SyntaxError を送出します。

https://docs.python.org/ja/3/library/constants.html

⇧ う、う～ん...結局よく分からんけども、Pythonの標準で用意されてる「定数」てことは分かりましたと。

PythonでNaN（Not a Number）とNaT（Not a Time）とNoneの違いって？

で、Pythonで「NaN（Not a Number）」と「NaT（Not a Time）」と「None」の違いって？

「NaN（Not a Number）」と「None」については情報がありました。

stackoverflow.com

Below are the differences:

nan belongs to the class float
None belongs to the class NoneType

I found the below article very helpful: https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31

python - What is a None value? - Stack Overflow

⇧ ってな感じで、属する「クラス」が異なりますと、つまり「型」が異なるってことみたいですね。

「NaT（Not a Time）」については、Pythonの拡張ライブラリである「Numpy」で提供されてる「データ型」っていうことなので、整理すると、

NaN（Not a Number）
Python標準のfloat型
None
Python標準のNoneType型
NaT（Not a Time）
Python拡張ライブラリNumpyのdatetime64型

ってな違いになりますと。

Pythonは「動的型付け言語」で「型」を意識しなくてもコーディングできるけど、結局のところ、「型」を意識しないと辛いってことですかね...

Pythonで値がunknownとは？

Pythonで値が「unkown」っていうことは、どういうことなのか？

ネットで検索してみるも、情報の探し方が下手くそ過ぎて、良い情報がヒットせずなんだけど、どうやら、Pythonの拡張ライブラリ「Numpy」の独自仕様っぽいってことですかね？

numpy.org

In order to be able to develop an intuition about what computation will be done by various NumPy functions, a consistent conceptual model of what a missing element means must be applied. Ferreting out the behaviors people need or want when they are working with “missing data” seems to be tricky, but I believe that it boils down to two different ideas, each of which is internally self-consistent.

https://numpy.org/neps/nep-0012-missing-data.html

One of them, the “unknown yet existing data” interpretation, can be applied rigorously to all computations, while the other makes sense for some statistical operations like standard deviation but not for linear algebra operations like matrix product. Thus, making “unknown yet existing data” be the default interpretation is superior, providing a consistent model across all computations, and for those operations where the other interpretation makes sense, an optional parameter “skipna=” can be added.

https://numpy.org/neps/nep-0012-missing-data.html

⇧ 何か、Pythonの拡張ライブラリの「Numpy」の中の人が、「不正な値」を扱うには2つのアプローチがあるって宣わっていて、

Unknown Yet Existing Data (NA)
Data That Doesn’t Exist Or Is Being Skipped (IGNORE)

を挙げていますかね。

ちょっと、詳しい説明は、上記のNumpyのページを参照いただくとして、「unknown」って値もPython標準の「データ型」としては用意されていないってことですかね。

なんか、

qiita.com

⇧ 上記サイト様によりますと、PandasのDataFrame.fillna()メソッドで「データ」の「欠損値」を置き換える際に、引数として「unknown」って値を指定してるのですが、

pandas.pydata.org

Parameters：

value：scalar, dict, Series, or DataFrame: Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

⇧ って感じで、設定できる引数の中で該当する「value」に設定できる値はと言うと、

scalar型 ※
Python標準のdict型
Python拡張ライブラリPandasのSeries型
Python拡張ライブラリPandasのDataFrame型

※ 公式のドキュメントには記載がないけれど、

jpt-pynotes.readthedocs.io

Python’s types are similar to what you’d find in other dynamic languages. This section will move pretty quickly, just showing off the major types and an example or two of their usage. It might be worth looking over Python’s built-in types documentation.

Scalar Types — Python Notes 0.1 documentation

We’ll start with the scalar types. A scalar is a type that can have a single value such as 5, 3.14, or ‘Bob’.

The commonly used scalar types in Python are:

int: Any integer.
float: Floating point number (64 bit precision)
complex: Numbers with an optional imaginary component.
bool: True, False
str: A sequence of characters (can contain unicode characters).
bytes: A sequence of unsigned 8-bit entities, used for manipulating binary data.
NoneType (None): Python’s null or nil equivalent, every instance of None is of NoneType.

Scalar Types — Python Notes 0.1 documentation

⇧ 上記サイト様によりますと、「int」「float」「complex」「bool」「str」「bytes」「NoneType」の7つの「型」を「scalar」型とするということみたいね。

脱線しましたが、PandasのDataFrame.fillna()メソッドの引数の中で「value」に設定できる値の説明として、

fill holes (e.g. 0)
a dict/Series/DataFrame of values specifying which value to use for each index (for a Series)
column (for a DataFrame)

※Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

ってな感じなんだけど、「unknown」を設定するなんてどっから出てきたんだ？

な状態なんですよね...

で、データサイエンス用のサンプルデータとかを提供してる「kaggle」を見てみたら、

www.kaggle.com

Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

https://www.kaggle.com/residentmario/data-types-and-missing-values

The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

https://www.kaggle.com/residentmario/data-types-and-missing-values

⇧ ってな感じで、よく分からんのだけど、「sentinl value」のような値であれば良いそうな。

「sentinel value」って？

In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically in a loop or recursive algorithm.