租房有深坑?手把手教你如何用R速讀評論+科學選房
大數據文摘出品
編譯:Hope、臻臻、CoolBoy
最近,租房這事兒成了北漂族的一大bug,要想租到稱心如意的房子,不僅要眼明手快,還得看清各類“前輩”的評價避開大坑。一位程序員在出行選酒店的時候就借用了程序工具:先用python爬下了海外點評網站TripAdvisor的數千評論,并且用R進行了文本分析和情感分析,科學選房,高效便捷,極具參考價值。
以下,這份超詳實的教程拿好不謝。
TripAdvisor提供的信息對于旅行者的出行決策非常重要。但是,要去了解TripAdvisor的泡沫評分和數千個評論文本之間的細微差別是極具挑戰性的。
為了更加全面地了解酒店旅客的評論是否會對之后酒店的服務產生影響,我爬取了TripAdvisor中一個名為Hilton Hawaiian Village酒店的所有英文評論。這里我不會對爬蟲的細節進行展開。
Python源碼:
https://github.com/susanli2016/NLP-with-Python/blob/master/Web%20scraping%20Hilton%20Hawaiian%20Village%20TripAdvisor%20Reviews.py
加載擴展包
- library(dplyr)
- library(readr)
- library(lubridate)
- library(ggplot2)
- library(tidytext)
- library(tidyverse)
- library(stringr)
- library(tidyr)
- library(scales)
- library(broom)
- library(purrr)
- library(widyr)
- library(igraph)
- library(ggraph)
- library(SnowballC)
- library(wordcloud)
- library(reshape2)
- theme_set(theme_minimal())
數據集
- df <- read_csv("Hilton_Hawaiian_Village_Waikiki_Beach_Resort-Honolulu_Oahu_Hawaii__en.csv")
- df <- df[complete.cases(df), ]
- df$review_date <- as.Date(df$review_date, format = "%d-%B-%y")
- dim(df); min(df$review_date); max(df$review_date)
Figure 2
我們在TripAdvisor上一共獲得了13,701條關于Hilton Hawaiian Village酒店的英文評論,這些評論的時間范圍是從2002–03–21 到2018–08–02。
- df %>%
- count(Week = round_date(review_date, "week")) %>%
- ggplot(aes(Week, n)) +
- geom_line() +
- ggtitle('The Number of Reviews Per Week')
Figure 2
在2014年末,周評論數量達到最高峰。那一個星期里酒店被評論了70次。
對評論文本進行文本挖掘
- df <- tibble::rowid_to_column(df, "ID")
- df <- df %>%
- mutate(review_date = as.POSIXct(review_date, origin = "1970-01-01"),month = round_date(review_date, "month"))
- review_words <- df %>%
- distinct(review_body, .keep_all = TRUE) %>%
- unnest_tokens(word, review_body, drop = FALSE) %>%
- distinct(ID, word, .keep_all = TRUE) %>%
- anti_join(stop_words, by = "word") %>%
- filter(str_detect(word, "[^\\d]")) %>%
- group_by(word) %>%
- mutate(word_total = n()) %>%
- ungroup()
- word_counts <- review_words %>%
- count(word, sort = TRUE)
- word_counts %>%
- head(25) %>%
- mutate(word = reorder(word, n)) %>%
- ggplot(aes(word, n)) +
- geom_col(fill = "lightblue") +
- scale_y_continuous(labels = comma_format()) +
- coord_flip() +
- labs(title = "Most common words in review text 2002 to date",
- subtitle = "Among 13,701 reviews; stop words removed",
- y = "# of uses")
Figure 3
我們還可以更進一步的把“stay”和“stayed”,“pool”和“pools”這些意思相近的詞合并起來。這個步驟被稱為詞干提取,也就是將變形(或是衍生)詞語縮減為詞干,基詞或根詞的過程。
- word_counts %>%
- head(25) %>%
- mutate(word = wordStem(word)) %>%
- mutate(word = reorder(word, n)) %>%
- ggplot(aes(word, n)) +
- geom_col(fill = "lightblue") +
- scale_y_continuous(labels = comma_format()) +
- coord_flip() +
- labs(title = "Most common words in review text 2002 to date",
- subtitle = "Among 13,701 reviews; stop words removed and stemmed",
- y = "# of uses")
Figure 4
二元詞組
通常我們希望了解評論中單詞的相互關系。哪些詞組在評論文本中比較常用呢?如果給出一列單詞,那么后面會隨之出現什么單詞呢?哪些詞之間的關聯性最強?許多有意思的文本挖掘都是基于這些關系的。在研究兩個連續單詞的時候,我們稱這些單詞對為“二元詞組”。
所以,在Hilton Hawaiian Village的評論中,哪些是最常見的二元詞組呢?
- review_bigrams <- df %>%
- unnest_tokens(bigram, review_body, token = "ngrams", n = 2)
- bigrams_separated <- review_bigrams %>%
- separate(bigram, c("word1", "word2"), sep = " ")
- bigrams_filtered <- bigrams_separated %>%
- filter(!word1 %in% stop_words$word) %>%
- filter(!word2 %in% stop_words$word)
- bigram_counts <- bigrams_filtered %>%
- count(word1, word2, sort = TRUE)
- bigrams_united <- bigrams_filtered %>%
- unite(bigram, word1, word2, sep = " ")
- bigrams_united %>%
- count(bigram, sort = TRUE)
Figure 5
最常見的二元詞組是“rainbow tower”(彩虹塔),其次是“hawaiian village”(夏威夷村)。
我們可以利用網絡可視化來展示這些二元詞組:
- review_subject <- df %>%
- unnest_tokens(word, review_body) %>%
- anti_join(stop_words)
- my_stopwords <- data_frame(word = c(as.character(1:10)))
- review_subject <- review_subject %>%
- anti_join(my_stopwords)
- title_word_pairs <- review_subject %>%
- pairwise_count(word, ID, sort = TRUE, upper = FALSE)
- set.seed(1234)
- title_word_pairs %>%
- filter(n >= 1000) %>%
- graph_from_data_frame() %>%
- ggraph(layout = "fr") +
- geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
- geom_node_point(size = 5) +
- geom_node_text(aes(label = name), repel = TRUE,
- point.padding = unit(0.2, "lines")) +
- ggtitle('Word network in TripAdvisor reviews')
- theme_void()
Figure 6
上圖展示了TripAdvisor評論中較為常見的二元詞組。這些詞至少出現了1000次,而且其中不包含停用詞。
在網絡圖中我們發現出現頻率最高的幾個詞存在很強的相關性(“hawaiian”, “village”, “ocean” 和“view”),不過我們沒有發現明顯的聚集現象。
三元詞組
二元詞組有時候還不足以說明情況,讓我們來看看TripAdvisor中關于Hilton Hawaiian Village酒店最常見的三元詞組有哪些。
- review_trigrams <- df %>%
- unnest_tokens(trigram, review_body, token = "ngrams", n = 3)
- trigrams_separated <- review_trigrams %>%
- separate(trigram, c("word1", "word2", "word3"), sep = " ")
- trigrams_filtered <- trigrams_separated %>%
- filter(!word1 %in% stop_words$word) %>%
- filter(!word2 %in% stop_words$word) %>%
- filter(!word3 %in% stop_words$word)
- trigram_counts <- trigrams_filtered %>%
- count(word1, word2, word3, sort = TRUE)
- trigrams_united <- trigrams_filtered %>%
- unite(trigram, word1, word2, word3, sep = " ")
- trigrams_united %>%
- count(trigram, sort = TRUE)
Figure 7
最常見的三元詞組是“hilton hawaiian village”,其次是“diamond head tower”,等等。
評論中關鍵單詞的趨勢
隨著時間的推移,哪些單詞或話題變得更加常見,或者更加罕見了呢?從這些信息我們可以探知酒店做出的調整,比如在服務上,翻新上,解決問題上。我們還可以預測哪些主題會更多地被提及。
我們想要解決類似這樣的問題:隨著時間的推移,在TripAdvisor的評論區中哪些詞出現的頻率越來越高了?
- reviews_per_month <- df %>%
- group_by(month) %>%
- summarize(month_total = n())
- word_month_counts <- review_words %>%
- filter(word_total >= 1000) %>%
- count(word, month) %>%
- complete(word, month, fill = list(n = 0)) %>%
- inner_join(reviews_per_month, by = "month") %>%
- mutate(percent = n / month_total) %>%
- mutate(yearyear = year(month) + yday(month) / 365)
- mod <- ~ glm(cbind(n, month_total - n) ~ year, ., family = "binomial")
- slopes <- word_month_counts %>%
- nest(-word) %>%
- mutate(model = map(data, mod)) %>%
- unnest(map(model, tidy)) %>%
- filter(term == "year") %>%
- arrange(desc(estimate))
- slopes %>%
- head(9) %>%
- inner_join(word_month_counts, by = "word") %>%
- mutate(word = reorder(word, -estimate)) %>%
- ggplot(aes(month, n / month_total, color = word)) +
- geom_line(show.legend = FALSE) +
- scale_y_continuous(labels = percent_format()) +
- facet_wrap(~ word, scales = "free_y") +
- expand_limits(y = 0) +
- labs(x = "Year",
- y = "Percentage of reviews containing this word",
- title = "9 fastest growing words in TripAdvisor reviews",
- subtitle = "Judged by growth rate over 15 years")
Figure 8
在2010年以前我們可以看到大家討論的焦點是“friday fireworks”(周五的煙花)和“lagoon”(環礁湖)。而在2005年以前“resort fee”(度假費)和“busy”(繁忙)這些詞的詞頻增長最快。
評論區中哪些詞的詞頻在下降呢?
- slopes %>%
- tail(9) %>%
- inner_join(word_month_counts, by = "word") %>%
- mutate(word = reorder(word, estimate)) %>%
- ggplot(aes(month, n / month_total, color = word)) +
- geom_line(show.legend = FALSE) +
- scale_y_continuous(labels = percent_format()) +
- facet_wrap(~ word, scales = "free_y") +
- expand_limits(y = 0) +
- labs(x = "Year",
- y = "Percentage of reviews containing this term",
- title = "9 fastest shrinking words in TripAdvisor reviews",
- subtitle = "Judged by growth rate over 4 years")
Figure 9
這張圖展示了自2010年以來逐漸變少的主題。這些詞包括“hhv” (我認為這是 hilton hawaiian village的簡稱), “breakfast”(早餐), “upgraded”(升級), “prices”(價格) and “free”(免費)。
讓我們對一些單詞進行比較。
- word_month_counts %>%
- filter(word %in% c("service", "food")) %>%
- ggplot(aes(month, n / month_total, color = word)) +
- geom_line(size = 1, alpha = .8) +
- scale_y_continuous(labels = percent_format()) +
- expand_limits(y = 0) +
- labs(x = "Year",
- y = "Percentage of reviews containing this term", title = "service vs food in terms of reviewers interest")
Figure 10
在2010年之前,服務(service)和食物(food)都是熱點主題。關于服務和食物的討論在2003年到達頂峰,自2005年之后就一直在下降,只是偶爾會反彈。
情感分析
情感分析被廣泛應用于對評論、調查、網絡和社交媒體文本的分析,以反映客戶的感受,涉及范圍包括市場營銷、客戶服務和臨床醫學等。
在本案例中,我們的目標是對評論者(也就是酒店旅客)在住店之后對酒店的態度進行分析。這個態度可能是一個判斷或是評價。
下面來看評論中出現得最頻繁的積極詞匯和消極詞匯。
- reviews <- df %>%
- filter(!is.na(review_body)) %>%
- select(ID, review_body) %>%
- group_by(row_number()) %>%
- ungroup()
- tidy_reviews <- reviews %>%
- unnest_tokens(word, review_body)
- tidy_reviews <- tidy_reviews %>%
- anti_join(stop_words)
- bing_word_counts <- tidy_reviews %>%
- inner_join(get_sentiments("bing")) %>%
- count(word, sentiment, sort = TRUE) %>%
- ungroup()
- bing_word_counts %>%
- group_by(sentiment) %>%
- top_n(10) %>%
- ungroup() %>%
- mutate(word = reorder(word, n)) %>%
- ggplot(aes(word, n, fill = sentiment)) +
- geom_col(show.legend = FALSE) +
- facet_wrap(~sentiment, scales = "free") +
- labs(y = "Contribution to sentiment", x = NULL) +
- coord_flip() +
- ggtitle('Words that contribute to positive and negative sentiment in the reviews')
Figure 11
讓我們換一個情感文本庫,看看結果是否一樣。
- contributions <- tidy_reviews %>%
- inner_join(get_sentiments("afinn"), by = "word") %>%
- group_by(word) %>%
- summarize(occurences = n(),
- contribution = sum(score))
- contributions %>%
- top_n(25, abs(contribution)) %>%
- mutate(word = reorder(word, contribution)) %>%
- ggplot(aes(word, contribution, fill = contribution > 0)) +
- ggtitle('Words with the greatest contributions to positive/negative
- sentiment in reviews') +
- geom_col(show.legend = FALSE) +
- coord_flip()
Figure 12
有意思的是,“diamond”(出自“diamond head-鉆石頭”)被歸類為積極情緒。
這里其實有一個潛在問題,比如“clean”(干凈)是什么詞性取決于語境。如果前面有個“not”(不),這就是一個消極情感了。事實上一元詞在否定詞(如not)存在的時候經常碰到這種問題,這就引出了我們下一個話題:
在情感分析中使用二元詞組來辨明語境
我們想知道哪些詞經常前面跟著“not”(不)
- bigrams_separated %>%
- filter(word1 == "not") %>%
- count(word1, word2, sort = TRUE)
Figure 13
“a”前面跟著“not”的情況出現了850次,而“the”前面跟著“not”出現了698次。不過,這種結果不是特別有實際意義。
- AFINN <- get_sentiments("afinn")
- not_words <- bigrams_separated %>%
- filter(word1 == "not") %>%
- inner_join(AFINN, by = c(word2 = "word")) %>%
- count(word2, score, sort = TRUE) %>%
- ungroup()
- not_words
Figure 14
上面的分析告訴我們,在“not”后面最常見的情感詞匯是“worth”,其次是“recommend”,這些詞都被認為是積極詞匯,而且積極程度得分為2。
所以在我們的數據中,哪些單詞最容易被誤解為相反的情感?
- not_words %>%
- mutate(contribution = n * score) %>%
- arrange(desc(abs(contribution))) %>%
- head(20) %>%
- mutate(word2 = reorder(word2, contribution)) %>%
- ggplot(aes(word2, n * score, fill = n * score > 0)) +
- geom_col(show.legend = FALSE) +
- xlab("Words preceded by \"not\"") +
- ylab("Sentiment score * number of occurrences") +
- ggtitle('The 20 words preceded by "not" that had the greatest contribution to
- sentiment scores, positive or negative direction') +
- coord_flip()
Figure 15
二元詞組“not worth”, “not great”, “not good”, “not recommend”和“not like”是導致錯誤判斷的最大根源,使得評論看起來比原來積極的多。
除了“not”以外,還有其他的否定詞會對后面的內容進行情緒的扭轉,比如“no”, “never” 和“without”。讓我們來看一下具體情況。
- negation_words <- c("not", "no", "never", "without")
- negated_words <- bigrams_separated %>%
- filter(word1 %in% negation_words) %>%
- inner_join(AFINN, by = c(word2 = "word")) %>%
- count(word1, word2, score, sort = TRUE) %>%
- ungroup()
- negated_words %>%
- mutate(contribution = n * score,
- word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
- group_by(word1) %>%
- top_n(12, abs(contribution)) %>%
- ggplot(aes(word2, contribution, fill = n * score > 0)) +
- geom_col(show.legend = FALSE) +
- facet_wrap(~ word1, scales = "free") +
- scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
- xlab("Words preceded by negation term") +
- ylab("Sentiment score * # of occurrences") +
- ggtitle('The most common positive or negative words to follow negations
- such as "no", "not", "never" and "without"') +
- coord_flip()
Figure 16
看來導致錯判為積極詞匯的最大根源來自于“not worth/great/good/recommend”,而另一方面錯判為消極詞匯的最大根源是“not bad” 和“no problem”。
最后,讓我們來觀察一下最積極和最消極的評論。
- sentiment_messages <- tidy_reviews %>%
- inner_join(get_sentiments("afinn"), by = "word") %>%
- group_by(ID) %>%
- summarize(sentiment = mean(score),
- words = n()) %>%
- ungroup() %>%
- filter(words >= 5)
- sentiment_messages %>%
- arrange(desc(sentiment))
Figure 17
最積極的評論來自于ID為2363的記錄:“哇哇哇,這地方太好了!從房間我們可以看到很漂亮的景色,我們住得很開心。Hilton酒店就是很棒!無論是小孩還是大人,這家酒店有著所有你想要的東西。”
- df[ which(df$ID==2363), ]$review_body[1]
Figure 18
- sentiment_messages %>%
- arrange(sentiment)
Figure 19
最消極的評論來自于ID為3748的記錄:“(我)住了5晚(16年5月12日-5月17日)。第一晚,我們發現地磚壞了,小孩子在玩手指。第二晚,我們看到小蟑螂在兒童食物上爬。前臺給我們換了房間,但他們讓我們一小時之內搬好房間,否則就不能換房。。。已經晚上11點,我們都很累了,孩子們也睡了。我們拒絕了這個建議。退房的時候,前臺小姐跟我講,蟑螂在他們的旅館里很常見。她還反問我在加州見不到蟑螂嗎?我沒想到能在Hilton遇到這樣的事情。”
- df[ which(df$ID==3748), ]$review_body[1]
Figure 20
Github源碼:
https://github.com/susanli2016/Data-Analysis-with-R/blob/master/Text%20Mining%20Hilton%20Hawaiian%20Village%20TripAdvisor%20Reviews.Rmd
相關報道:
https://towardsdatascience.com/scraping-tripadvisor-text-mining-and-sentiment-analysis-for-hotel-reviews-cc4e20aef333
【本文是51CTO專欄機構大數據文摘的原創文章,微信公眾號“大數據文摘( id: BigDataDigest)”】