英語の自然言語処理を学んだのでまとめてみた

f:id:kj_man666:20200802115532j:plain

こちらのNLPの記事（英語）が勉強になりました。

Text Analysis & Feature Engineering with NLP

勉強を兼ねてkaggleのNPLコンペ、Real or Not? NLP with Disaster Tweets のデータセットを使って自然言語処理をやってみました。

本記事から学べるのは、英語のNLPにおける、言語の判定、正規表現を使った前処理、トークン化、ストップワード、語幹の抽出、見出し語化、単語・文字・文章のカウントによる特徴量の作成、感情分析、アノテーション、ワードクラウド、word embeddings、LDAになります。

コードはこちらになります。

ライブラリのインポート

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
from IPython.display import display

pd.set_option('max_columns', 500)
pd.set_option('max_rows', 500)

import warnings
warnings.filterwarnings('ignore')

使用されている言語を検出するライブラリ

import langdetect

NLTK(Natural Language Toolkit)という、英語の自然言語処理のライブラリ

import nltk

TextBlob().sentiment.polarityでテキストの感情を数値化できる

from textblob import TextBlob

spaCyは自然言語用のライブラリ

import spacy

WordCloud用のライブラリ

import wordcloud

データセット/モデルのダウンロード、情報の取得、およびロードのためのAPI

import gensim.downloader as gensim_api

t-SNE は、高次元データを視覚化するためのツール

from sklearn import manifold

主にテキスト解析を対象としたスケーラブルな機械学習ライブラリで、Word2VecやDoc2VecをシンプルなAPIで利用することができる。

import gensim

データの読み込み

こちらからtrain.csv と test.csv をダウンロードしましょう。

path = os.getcwd() + "/"

# trainデータとtestデータの読込
train = pd.read_csv(path + "train.csv")
test  = pd.read_csv(path + "test.csv")

print("train")
display(train.head(3))
display(train.tail(3))
display(train.shape)

print("test")
display(test.head(3))
display(test.tail(3))
display(test.shape)

データの中身

f:id:kj_man666:20200802091531p:plain

keyword カラムとlocation カラムのボリュームを確かめる

# keyword カラムとlocation カラムのボリュームを確かめる
print("train data  keyword :{0}, location :{1}".format(train["keyword"].unique().shape[0], train["location"].unique().shape[0]))
print("test data  keyword :{0}, location :{1}".format(test["keyword"].unique().shape[0], test["location"].unique().shape[0]))

train data keyword :222, location :3342 test data keyword :222, location :1603

location は数が多すぎるのでカットします。

train = train.drop("location", axis=1)
test = test.drop("location", axis=1)

keyword の欠損値をカウント

print(train["keyword"].isnull().sum())
print(test["keyword"].isnull().sum())

61 26

欠損値を削除します。

train = train.dropna(axis=0)
test = test.dropna(axis=0)

keywordの分布

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 10))

train["keyword"].reset_index().groupby("keyword").count().sort_values(by="index")[:20].plot(ax=axes[0], kind="barh", title='Train', legend=False)
test["keyword"].reset_index().groupby("keyword").count().sort_values(by="index")[:20].plot(ax=axes[1], kind="barh", titl

f:id:kj_man666:20200802095640p:plain

何語なのか判定

train['lang'] = train["text"].apply(lambda x: langdetect.detect(x) if 
                                 x.strip() != "" else "")

test['lang'] = train["text"].apply(lambda x: langdetect.detect(x) if 
                                 x.strip() != "" else "")

display(train.head())
display(test.head)

f:id:kj_man666:20200802100131p:plain

記号（ ' や ! やその他諸々）の入ったテキストの確認

print("train")
display(train[train["text"].str.contains(r'[^\s\w]')].head(3))
display(train[train["text"].str.contains(r'[^\s\w]')].tail(3))
display(train[train["text"].str.contains(r'[^\s\w]')]["text"].count().sum())

print("test")
display(test[test["text"].str.contains(r'[^\s\w]')].head(3))
display(test[test["text"].str.contains(r'[^\s\w]')].tail(3))
display(test[test["text"].str.contains(r'[^\s\w]')]["text"].count().sum())

f:id:kj_man666:20200802100931p:plain

参考記事

qiita.com

記号を削除します。

train['text_clean'] = train["text"].apply(lambda x: re.sub(r'[^\w\s]',
                                                     '', x).lower().strip())

test['text_clean'] = test["text"].apply(lambda x: re.sub(r'[^\w\s]',
                                                     '', x).lower().strip())

display(train.head())
display(test.head())

f:id:kj_man666:20200802101147p:plain

トークン化

# text カラムを配列に変更
txt_train = train["text_clean"].values.tolist()
txt_test  = test["text_clean"].values.tolist()

# 単語に分解してトークン化
txt_train = [x.split() for x in txt_train]
txt_test = [x.split() for x in txt_test]

# リストの最初の要素を表示
display(txt_train[0])
display(txt_test[0])

['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all'] ['just', 'happened', 'a', 'terrible', 'car', 'crash']

ストップワード

# ストップワードリストの作成
nltk.download('stopwords')
lst_stopwords = nltk.corpus.stopwords.words("english")

# ストップワードの中身
lst_stopwords[:20]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

ストップワードの削除

for i in range(len(txt_train)):
    txt_train[i] = [word for word in txt_train[i] if word not in lst_stopwords]
    
for i in range(len(txt_test)):
    txt_test[i] = [word for word in txt_test[i] if word not in lst_stopwords]

display(txt_train[0])
display(txt_test[0])

['deeds', 'reason', 'earthquake', 'may', 'allah', 'forgive', 'us'] ['happened', 'terrible', 'car', 'crash']

削除前

['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all'] ['just', 'happened', 'a', 'terrible', 'car', 'crash']

our、are、the 等のあまり文意のない単語が削除されていることがわかります。

参考記事

blog.livedoor.jp

語幹の抽出

例：going から go、unbelievable から unbelieve を抽出

ps = nltk.stem.porter.PorterStemmer()

# 語幹の抽出
for i in range(len(txt_train)):
    txt_train[i] = [ps.stem(word) for word in txt_train[i]]
    
for i in range(len(txt_test)):
    txt_test[i] = [ps.stem(word) for word in txt_test[i]]

display(txt_train[0])
display(txt_test[0])

['deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'us'] ['happen', 'terribl', 'car', 'crash']

語幹抽出前

['deeds', 'reason', 'earthquake', 'may', 'allah', 'forgive', 'us'] ['happened', 'terrible', 'car', 'crash']

deeds が deedに、happened が happen に変化しています。

forgive が forgiv に、earthquake が earthquak になったりもしていますが。。

参考記事

www.haya-programming.com

見出し語化

単語を、辞書に載っている形に従って分類すること

lem = nltk.stem.wordnet.WordNetLemmatizer()
nltk.download('wordnet')

# 見出し語化
for i in range(len(txt_train)):
    txt_train[i] = [lem.lemmatize(word) for word in txt_train[i]]
    
for i in range(len(txt_test)):
    txt_test[i] = [lem.lemmatize(word) for word in txt_test[i]]

display(txt_train[0])
display(txt_test[0])

['deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'u'] ['happen', 'terribl', 'car', 'crash']

見出し語化前

['deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'us'] ['happen', 'terribl', 'car', 'crash']

us が u に変化しています。

参考記事

yottagin.com

特徴量の作成

# 単語カウント数
train['word_count'] = train["text"].apply(lambda x: len(str(x).split(" ")))

# 文字カウント数
train['char_count'] = train["text"].apply(lambda x: sum(len(word) for word in str(x).split(" ")))

# 文章カウント数
train['sentence_count'] = train["text"].apply(lambda x: len(str(x).split(".")))

# 単語の文字数（平均）
train['avg_word_length'] = train['char_count'] / train['word_count']

# 文章の単語数（平均）
train['avg_sentence_lenght'] = train['word_count'] / train['sentence_count']


# 単語カウント数
test['word_count'] = test["text"].apply(lambda x: len(str(x).split(" ")))

# 文字カウント数
test['char_count'] = test["text"].apply(lambda x: sum(len(word) for word in str(x).split(" ")))

# 文章カウント数
test['sentence_count'] = test["text"].apply(lambda x: len(str(x).split(".")))

# 単語の文字数（平均）
test['avg_word_length'] = test['char_count'] / test['word_count']

# 文章の単語数（平均）
test['avg_sentence_lenght'] = test['word_count'] / test['sentence_count']

display(train["text"][0])
display(train.head(1))
display(test["text"][0])
display(test.head(1))

f:id:kj_man666:20200802104718p:plain

感情分析

TextBlob().sentiment.polarityでテキストの感情を数値化できる

textblob.readthedocs.io

train["sentiment"] = train["text"].apply(lambda x: TextBlob(x).sentiment.polarity)
test["sentiment"] = train["text"].apply(lambda x: TextBlob(x).sentiment.polarity)

display(train.head(3))
display(test.head(3))

f:id:kj_man666:20200802105306p:plain

アノテーション

事前にコマンドラインで

python -m spacy download en_core_web_lg

を入力し、ダウンロードしましょう、install後は開発環境を立ち上げなおしましょう。

参考記事

stackoverflow.com

nlp = spacy.load("en_core_web_lg")

# 固有表現認識
train["tags"] = train["text"].apply(lambda x: [(tag.text, tag.label_) 
                                for tag in nlp(x).ents])

test["tags"] = test["text"].apply(lambda x: [(tag.text, tag.label_) 
                                for tag in nlp(x).ents])

display(train.head(3))
display(test.head(3))

固有表現認識とは、テキストに出現する人名や地名などの固有名詞や、日付や時間などの数値表現を認識する技術のこと

hironsan.hatenablog.com

参考記事

qiita.com

ワードクラウド

corpus = train["text_clean"]

wc = wordcloud.WordCloud(background_color='black', max_words=100, 
                         max_font_size=35)
wc = wc.generate(str(corpus))

fig = plt.figure(figsize=(12.0, 8.0), num=1)
plt.axis('off')
plt.imshow(wc, cmap=None)
plt.show()

f:id:kj_man666:20200802110241p:plain

参考記事

amueller.github.io

単語ベクタライズ word embeddings

単語埋め込み (Word embeddings)

単語埋め込みを使うと、似たような単語が似たようにエンコードされる、効率的で密な表現が得られます。

www.tensorflow.org

nlp = gensim_api.load("glove-wiki-gigaword-300")

実行すると376.1MBのデータのダウンロードが始まる

glove-wiki-gigaword-300

Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

サイズ300の単語のベクトル表現を取得するための教師なし学習アルゴリズム

このオブジェクトを使用して、単語をベクトルにマッピングできる

github.com

### 単語を選択
word = "dance"

labels, X, x, y = [], [], [], []

for t in nlp.most_similar(word, topn=20):
    X.append(nlp[t[0]])
    labels.append(t[0])

gensim.models.Word2Vec.most_similar() は上位N個の最も類似した単語を見つけます。

tedboy.github.io

pca = manifold.TSNE(perplexity=40, n_components=2, init='pca')

new_values = pca.fit_transform(X)
for value in new_values:
    x.append(value[0])
    y.append(value[1])

t-SNE は、高次元データを視覚化するためのツール

scikit-learn.org

bunseki-train.com

## グラフ
fig = plt.figure()
for i in range(len(x)):
    plt.scatter(x[i], y[i], c="black")
    plt.annotate(labels[i], xy=(x[i],y[i]), xytext=(5,2), 
               textcoords='offset points', ha='right', va='bottom')

f:id:kj_man666:20200802111024p:plain

plt.scatter(x=0, y=0, c="red")
plt.annotate(word, xy=(0,0), xytext=(5,2), textcoords='offset points', ha='right', va='bottom')

f:id:kj_man666:20200802111811p:plain

.annotate() は矢印の描画、名称を追記

参考記事

qiita.com

LDAモデル

LDA は1つの文書が複数のトピックから成ることを仮定した言語モデルの一種

corpus = train["text_clean"]

## 前処理
lst_corpus = []

for string in corpus:
    lst_words = string.split()
    lst_grams = [" ".join(lst_words[i:i + 2]) for i in range(0, 
                     len(lst_words), 2)]
    lst_corpus.append(lst_grams)

## 単語をＩＤに対応付ける
id2word = gensim.corpora.Dictionary(lst_corpus)

## 単語の頻出数の作成
dic_corpus = [id2word.doc2bow(word) for word in lst_corpus] 

lda_model = gensim.models.ldamodel.LdaModel(corpus=dic_corpus, id2word=id2word, num_topics=3, 
                                            random_state=42, update_every=1, chunksize=100, 
                                            passes=10, alpha='auto', per_word_topics=True)

get_topics() は「トピック数×語彙数」の行列を返す。

lst_dics = []
for i in range(0,3):
    # get_topics() 「トピック数×語彙数」の行列を返す。
    # http://kento1109.hatenablog.com/entry/2017/12/27/114811
    lst_tuples = lda_model.get_topic_terms(i)
    
    for tupla in lst_tuples:
        lst_dics.append({"topic":i, "id":tupla[0], 
                         "word":id2word[tupla[0]], 
                         "weight":tupla[1]})
        
dtf_topics = pd.DataFrame(lst_dics, 
                         columns=['topic','id','word','weight'])

# グラフ
fig, ax = plt.subplots(figsize=(5, 10))
sns.barplot(y="word", x="weight", hue="topic", data=dtf_topics, dodge=False, ax=ax).set_title('Main Topics')
ax.set(ylabel="", xlabel="Word Importance")
plt.show()