Kaggle 機械学習 Aidemy ど素人初めて

今後はコードをブログ内で書くこともあるだろうと思い、wordpressとコードで検索した結果、下記のサイトからHighlighting Code Blockというプラグインを見つけて、インストールしました。

今回はデータをまずは眺めてみることを目的として、コードや考察を記述していきます。まずはデータを格納します。

#必要なデータを格納。()内は保存したフォルダパスを記載します。
train_df = pd.read_csv('/kaggle/input/sales_train.csv')
test_df = pd.read_csv('/kaggle/input/test.csv')
item = pd.read_csv('/kaggle/input/items.csv')
shops = pd.read_csv('/kaggle/input/shops.csv')
categories = pd.read_csv('/kaggle/input/item_categories.csv')

それではそれぞれのデータの中身をみてみます。

#各データの特徴量と欠損値の数を確認
print("train_df")
train_df.info()
print("----------------")
print("test_df")
test_df.info()
print("----------------")
print("item")
item.info()
print("----------------")
print("shops")
shops.info()
print("----------------")
print("categories")
categories.info()
print("----------------")

上のコードの実行結果はこちら。
train_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
----------------
test_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB
----------------
item
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB
----------------
shops
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   shop_name  60 non-null     object
 1   shop_id    60 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
----------------
categories
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ KB
----------------

train_dfをtrain.head()でみてみると、下のような構成。先頭5行しか出力してませんが、上の結果から見ると2935848行のデータがあるようです。shopとitemがidで記載されているので、shopsとitemsを結合して、それぞれの情報を付け加える必要がありそうです。item_priceは単価で、item_cnt_dayはぱっと見わからないので、Kaggleデータ情報を後ほど確認する必要がありそうです。(カラムが一つずれているようです。わかりづらくてすいません。)

date	date_block_num	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

test_dfを同様にみてみると、またカラムが一つずれてますが下のような形。どんな形で出力するかちゃんと確認する必要がありますが、これらの売り上げを予測するのがこのコンペの目的です。こちらもshop, itemテーブルと結合してやらないとさっぱりわからないですね。

ID	shop_id	item_id
0	0	5	5037
1	1	5	5320
2	2	5	5233
3	3	5	5232
4	4	5	5268

さてこちらがitemテーブル。category_idがありますので、こちらもcategoryテーブルと結合してやる必要がありそうです。そしてitem_nameがまさかのおそらくロシア語。。。Titanicでは先頭文字などを抜き出してグループに再分配していましたが、これはどうしたものか。もう少し分析を進めながら対応を考えないといけないかもしれません。

item_name	item_id	item_category_id
0	! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D	0	40
1	!ABBYY FineReader 12 Professional Edition Full…	1	76
2	***В ЛУЧАХ СЛАВЫ (UNV) D	2	40
3	***ГОЛУБАЯ ВОЛНА (Univ) D	3	40
4	***КОРОБКА (СТЕКЛО) D	4	40

こちらがshopテーブル。これまたロシア語が光ります。。。名前しか情報がないですが、売り上げ予測に役に立つのでしょうか。。。

shop_name	shop_id
0	!Якутск Орджоникидзе, 56 фран	0
1	!Якутск ТЦ “Центральный” фран	1
2	Адыгея ТЦ “Мега”	2
3	Балашиха ТРК “Октябрь-Киномир”	3
4	Волжский ТЦ “Волга Молл”	4

categoryはこちら。これまたロシア語の名前しかない。。。全部結合したところで売り上げ予測の役に立つイメージが湧きませんが、どうなっていくでしょう。。。

item_category_name	item_category_id
0	PC – Гарнитуры/Наушники	0
1	Аксессуары – PS2	1
2	Аксессуары – PS3	2
3	Аксессуары – PS4	3
4	Аксессуары – PSP	4

データを一通りみたところで、下記のURLからこのコンペの目的をみてみましょう。

https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation

Submissions are evaluated by root mean squared error (RMSE). True target values are clipped into [0,20] range.と記載がありますので、まず評価はRMSEで行われるようです。

そしてFor each id in the test set, you must predict a total number of sales. The file should contain a header and have the following format:の記載からsalesの数量を予測するようです。またsample_submission.csvをみたところ、214200個のデータ数なので単位はitemで間違えなさそうです。

ついでにDescriptionをみてみると、

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms – 1C Company.

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

とのことで、ロシアのソフトウェア企業の来月の売り上げを予測するというものでした。train_dfは日別で出ていましたので、この辺りも加工が必要そうです。

ということで、次回はtrain_dfにitem, shop, categoryの結合と各カラムのデータ型などを進めていきます。

タグ: kaggle, predict future sales, python

ど素人のKaggle 2 -データをみてみた-

コメントを残すコメントをキャンセル

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

ど素人のKaggle 2 -データをみてみた-

コメントを残す コメントをキャンセル

Related Post

anaconda->jupyter notebookエラーと対応anaconda->jupyter notebookエラーと対応

ど素人のkaggle3 -力づくで1st submit-ど素人のkaggle3 -力づくで1st submit-

kaggleトライ -Home credit 4-kaggleトライ -Home credit 4-

コメントを残すコメントをキャンセル