我們分析了復仇者聯盟系列所有臺詞，看看英雄們都愛說什么？沒有劇透！

作者：文摘菌 2019-04-28 06:27:25

大數據數據分析

復仇者系列火遍全球絕非意外，這部作品盡管出現了各種人物，而臺詞可以說是最能塑造人物性格的部分了。因此，文摘菌希望用數據分析的方式，看看漫威宇宙的英雄喜歡用的詞匯可視化，并通過此分析他們的人物特點，向這部偉大的作品致敬。

[[263786]]

大數據文摘出品

作者：蔣寶尚、魏子敏

《復仇者聯盟4》終于上映，這部匯集了10年回憶打造的電影，據看過的小伙伴們表示：3小時劇情，毫無尿點，全程都是經典回憶。

忙著工作還沒來得及看電影，又超怕被劇透的文摘菌這兩天的狀態基本是這樣👇

[[263787]]

萬般捉急的文摘菌在這周也去重新回憶了一下這個系列的作品。這部電影是復仇者系列的終結作品，能有如此成就，離不開《鋼鐵俠》、《美國隊長》，《雷神》、《綠巨人》等獨立敘事電影為其構建的宏大的宇宙觀，也在全球觀眾心里種下同一種英雄情結。

復仇者系列火遍全球絕非意外，這部作品盡管出現了各種人物，但是每個英雄又都被塑造地各具特色，讓人一次就能記住。

而臺詞可以說是最能塑造人物性格的部分了。因此，文摘菌希望用數據分析的方式，看看漫威宇宙的英雄喜歡用的詞匯可視化，并通過此分析他們的人物特點，向這部偉大的作品致敬。

本次分析，我們主要使用了R語言進行編程，目的是找出最能代表每位英雄的詞匯。數據選用了三個比較有代表性的漫威英雄交叉度極高的劇本，分別是：《復仇者聯盟》(就是打洛基的那一部)、《復仇者聯盟：奧創紀元》以及《美國隊長：內戰》。

上代碼前，先來看看分析結果。

1. 美國隊長：以你的名字呼喚你-鋼鐵俠!

作為聯盟的老大哥，美國隊長超愛喊別人的名字。并且我們發現，他口中最經常出現的名字就是鋼鐵俠。此外，還經常點名的是Sam,和Strucker。

[[263788]]

美國隊長和鋼鐵俠可謂《復仇者聯盟》系列中兩大相愛相殺的主角了。兩人在電影中都是領導級別的角色，但是兩者的追求卻有很大的差異。在電影《美國隊長：內戰》中，復仇者聯盟團隊徹底分崩離析，分別從屬了美國隊長和鋼鐵俠兩大陣營。

一方面美國隊長為了自己的好朋友冬兵戰斗，另一方鋼鐵俠為了維護世界的秩序和為自己的父母報仇戰斗。兩者即是好友，又是同級別的對手，這或許也就解釋了為什么美國隊長總是叫鋼鐵俠的名字。

2. 黑豹：最喜歡談論“中二“話題的貴族

從分析結果來看，黑豹最喜歡說的是父親、朋友，國王等聽起來比較“中二“的詞語。

[[263789]]

黑豹的父親前任黑豹特查卡，瓦坎達的國王!守護者振金，是黑豹的偶像，卻在電影中死于一場陰謀。而黑豹作為瓦坎達的年輕的王位繼承人，將他父親的遺志作為了追求的夢想，守護著瓦坎達。國王身份，追求理想，這就是黑豹喜歡談論這類貴族話題的原因。

3. 蜘蛛俠：我還是個“寶寶“。

[[263790]]

作為全隊的“小朋友“，蜘蛛俠在復仇者聯盟系列電影中的臺詞一直比較幼齒，他在電影中說的最多的是詞是：“嗨”、“呃”、“嗯”。

在這三部電影中，蜘蛛俠只是一個十幾歲的孩子，在這么多大人物面前如果再不蹦蹦跳跳，那就更沒有存在感了☺。

4. 浩克和鷹眼：大家都愛黑寡婦。

通過可視化分析可以發現，綠巨人和鷹眼都非常喜歡提到黑寡婦。

[[263791]]

浩克喜歡和黑寡婦聊天原因很簡單，因為當綠巨人發狂時，黑寡婦用滿心關愛的眼神瞅著他那龐大的身軀，對他說道：“嘿，大塊頭，太陽快下山了!”然后慢慢地舉起了手，用她那柔軟的手指，伸向了綠巨人的手臂，輕輕滑了下來。這時候浩克就會平息他那滿腔的怒火!

電影中黑寡婦和鷹眼不是戀人或者情侶，他們的關系一直戀人未滿、曖昧不清。但是，因為兩人在復仇者聯盟之前就已經發生了一系列故事。刀光劍影，愛恨情仇，即是老友又是戰友，或許兩人早已暗生情愫。

5. 幻世和緋紅女巫：惺惺相惜，在線發糖!

從數據可視化的結果中可以看到，幻視和緋紅女巫絕對是soulmate了，兩人的談論內容都很一致，特別喜歡說“恐懼、擔憂“類話題!

關于緋紅女巫，我們可以從她童年的經歷和非人的待遇中找到原因。而幻世作為超級人工智能，能夠看到別的英雄看不到的“畫面”，可能對未來的擔憂讓他心煩意亂。

6. 托爾：能力越大，責任越大，考慮深遠

托爾作為雷神，擁有多種魔法能力，例如：操控風暴，釋放或控制閃電，將閃電能量實體化為盔甲，瞬間改變天氣，利用雷神之錘飛行，召喚雷神之錘令其飛回托爾用閃電與敵人交戰。

[[263792]]

雷神除去強大的戰斗力，托爾還掌握著星際級的知識。例如：寶石知識、各式星際飛船駕駛技術、格魯特語(格魯特所在種族的語言)、蟲洞知識。

或許是能力越大，責任越大，他比其他英雄角色看的更遠。在電影中，他對推動劇情前進的物品更加專注，例如洛基的權杖以及心靈寶石。

7. 洛基：追逐權力。

[[263793]]

洛基從小和雷神托爾一起長大。一直窺視眾神之王的寶座且不認同雷神托爾會是一位合格的繼承人。他野心十足想當老大，陰險狡詐陷害兄長、反逆父母，視天下生命如草芥，為了目的不擇手段。

總之一句話，他非常想要權力!

8. 奧創：更愛“詩和遠方“。

奧創被制造出來的目的是為了守護和平，但是一誕生就發生錯誤，認為想要和平就要消滅人類和復仇者聯盟，于是搶走洛基權杖(心靈寶石)從尤利西斯·克勞手中弄到大量的振金，操縱趙海倫利用再生搖籃幫其制造幻視身體，想要進化得更強。

換句話說，奧創一出生就被訂上了守護和平的烙印，雖然他看問題的角度不同，但是和復仇者聯盟有著共同的任務。所以，它更加向往詩與遠方!

上面條條的長度對應的是超級英雄使用每個詞匯的程度。

可視化過程

最后，分析完全劇的角色，我們也來一起看看整個可視化過程。

導入R語言包：

library(dplyr) 
library(grid) 
library(gridExtra) 
library(ggplot2) 
library(reshape2) 
library(cowplot) 
library(jpeg) 
library(extrafont)

清除R工作環境中的全部東西：

rm(list = ls())

加載包含所有圖片的文件夾(根據你自己的情況修改代碼)：

dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers" 
setwd(dir_images)

設置字體：

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))

英雄角色名字的簡化版本：

character_names <- c("black_panther","black_widow","bucky","captain_america", 
                    "falcon","hawkeye","hulk","iron_man", 
                    "loki","nick_fury","rhodey","scarlet_witch", 
                    "spiderman","thor","ultron","vision") 
image_filenames <- paste0(character_names, ".jpg")

將所有圖片讀入一個列表中。

all_images <- lapply(image_filenames, read_image)

將角色名字分配給圖像列表，以便按名字對其進行索引。

names(all_images) <- character_names

例如：

# clear the plot window 
grid.newpage() 
# draw to the plot window 
grid.draw(rasterGrob(all_images[['vision']]))

[[263794]]

獲得文本數據

數據由計算機科學家Elle O'Brien收集的，使用文本挖掘技術對電影劇本分析。

更正專有名稱的大寫：

capitalize <- Vectorize(function(string){ 
  substr(string,1,1) <- toupper(substr(string,1,1)) 
  return(string) 
}) 
proper_noun_list <- c("clint","hydra","steve","tony", 
                      "sam","stark","strucker","nat","natasha", 
                      "hulk","tesseract", "vision", 
                      "loki","avengers","rogers", "cap", "hill") 
# Run the capitalization function 
word_data <- word_data %>% 
  mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>% 
  mutate(word = ifelse(word == "jarvis", "JARVIS", word))

請注意，以前的簡版角色名字與文本dataframe格式中的角色不匹配。

unique(word_data$Speaker) 
##  [1] "Black Panther"   "Black Widow"     "Bucky" 
##  [4] "Captain America" "Falcon"          "Hawkeye" 
##  [7] "Hulk"            "Iron Man"        "Loki" 
## [10] "Nick Fury"       "Rhodey"          "Scarlet Witch" 
## [13] "Spiderman"       "Thor"            "Ultron" 
## [16] "Vision"

創建一個索引表，將文件名轉換為角色名。

character_labeler <- c(`black_panther` = "Black Panther", 
                       `black_widow` = "Black Widow", 
                       `bucky` = "Bucky", 
                       `captain_america` = "Captain America", 
                       `falcon` = "Falcon", `hawkeye` = "Hawkeye", 
                       `hulk` = "Hulk", `iron_man` = "Iron Man", 
                       `loki` = "Loki", `nick_fury` = "Nick Fury", 
                       `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch", 
                       `spiderman`="Spiderman", `thor`="Thor", 
                       `ultron` ="Ultron", `vision` ="Vision")

有兩個不同版本的角色名，一個用于顯示(漂亮)，一個用于索引(簡單)。

convert_pretty_to_simple <- Vectorize(function(pretty_name){ 
  # pretty_name = "Vision" 
  simple_name <- names(character_labeler)[character_labeler==pretty_name] 
  # simple_name <- as.vector(simple_name) 
  return(simple_name) 
 
}) 
# convert_pretty_to_simple(c("Vision","Thor")) 
# just for fun, the inverse of that function 
convert_simple_to_pretty <- function(simple_name){ 
  # simple_name = "vision" 
  pretty_name <- character_labeler[simple_name] %>% as.vector() 
  return(pretty_name) 
} 
# example 
convert_simple_to_pretty(c("vision","black_panther"))

## [1] "Vision"        "Black Panther"

將簡化的角色名稱添加到文本數據框架中。

word_data$character <- convert_pretty_to_simple(word_data$Speaker)

為每個角色指定主顏色：

character_palette <- c(`black_panther` = "#51473E", 
                       `black_widow` = "#89B9CD", 
                       `bucky` = "#6F7279", 
                       `captain_america` = "#475D6A", 
                       `falcon` = "#863C43", `hawkeye` = "#84707F", 
                       `hulk` = "#5F5F3F", `iron_man` = "#9C2728", 
                       `loki` = "#3D5C25", `nick_fury` = "#838E86", 
                       `rhodey` = "#38454E",`scarlet_witch` ="#620E1B", 
                       `spiderman`="#A23A37", `thor`="#323D41", 
                       `ultron` ="#64727D", `vision` ="#81414F" )

繪制條形圖☟

avengers_bar_plot <- word_data %>% 
  group_by(Speaker) %>% 
  top_n(5, amount) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, amount)) %>% 
  ggplot(aes(x = word, y = amount, fill = character))+ 
  geom_bar(stat = "identity", show.legend = FALSE)+ 
  scale_fill_manual(values = character_palette)+ 
  scale_y_continuous(name ="Log Odds of Word", 
                     breaks = c(0,1,2)) + 
  theme(text = element_text(family = "Franklin"), 
        # axis.title.x = element_text(size = rel(1.5)), 
        panel.grid = element_line(colour = NULL), 
        panel.grid.major.y = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill = "white", 
                                    colour = "white"))+ 
  # theme(strip.text.x = element_text(size = rel(1.5)))+ 
  xlab("")+ 
  coord_flip()+ 
  facet_wrap(~Speaker, scales = "free_y") 
avengers_bar_plot

這已經非常漂亮了，但是還可以更漂亮。比如人物形象通過“線條”顯示出來。具體做法是將透明的條形圖全覆蓋，然后從端點向里繪制白色的條形圖，注意條形圖是能夠遮擋圖片的。

[[263795]]

在數據框架中，用達到總值所需的余數來補充數值，這樣當將值和余數組合在一起時，就會形成長度一致的線條組合。

max_amount <- max(word_data$amount) 
word_data$remainder <- (max_amount - word_data$amount) + 0.2

每個英雄角色僅提取5個關鍵詞。

word_data_top5 <- word_data %>% 
  group_by(character) %>% 
  arrange(desc(amount)) %>% 
  slice(1:5) %>% 
  ungroup()

將“amount”和“remaining”的格式進行轉換：

[[263796]]

確保每個角色有兩個長條;一個用于顯示amount，另一個用于選擇結束位置。

這會將“amount”和“remaining”折疊成一個名為“variable”的列，指示它是哪個值，另一列“value”包含每個值中的數字。

word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))

將這些條形圖放在有序因素中，與在數據融合中相反。否則，“amount”和“remainder”將在圖上以相反的順序顯示。

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable, 
 
                                     levels = rev(levels(word_data_top5_m$variable)))

每個角色僅僅顯示五個詞匯

注意角色名稱的版本問題，例如采用“black_panther”而不是“Black Panther”。

plot_char <- function(character_name){ 
  # example: character_name = "black_panther" 
  # plot details that we might want to fiddle with 
  # thickness of lines between bars

 bar_outline_size <- 0.5 
  # transparency of lines between bars 
  bar_outline_alpha <- 0.25 
  # 
  # The function takes the simple character name, 
  # but here, we convert it to the pretty name, 
  # because we'll want to use that on the plot. 
  pretty_character_name <- convert_simple_to_pretty(character_name) 
 
  # Get the image for this character, 
  # from the list of all images. 
  temp_image <- all_images[character_name] 
  # Make a data frame for only this character 
  temp_data <- word_data_top5_m %>% 
    dplyr::filter(character == character_name) %>% 
    mutate(character = character_name) 
  # order the words by frequency 
  # First, make an ordered vector of the most common words 
  # for this character 
    ordered_words <- temp_data %>% 
      mutate(word = as.character(word)) %>% 
      dplyr::filter(variable == "amount") %>% 
      arrange(value) %>% 
      `[[`(., "word") 
 
    # order the words in a factor, 
    # so that they plot in this order, 
    # rather than alphabetical order 
    temp_data$word = factor(temp_data$word, levels = ordered_words) 
 
  # Get the max value, 
  # so that the image scales out to the end of the longest bar 
  max_value <- max(temp_data$value) 
  fill_colors <- c(`remainder` = "white", `value` = "white") 
  # Make a grid object out of the character's image 
  character_image <- rasterGrob(all_images[[character_name]], 
                                width = unit(1,"npc"), 
                                height = unit(1,"npc")) 
 
  # make the plot for this character 
  output_plot <- ggplot(temp_data)+ 
    aes(x = word, y = value, fill = variable2)+ 
    # add image 
    # draw it completely bottom to top (x), 
    # and completely from left to the the maximum log-odds value (y) 
 
    # note that x and y are flipped here, 
    # in prep for the coord_flip() 
    annotation_custom(character_image, 
                      xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) + 
    geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha), 
             size = bar_outline_size, width = 1)+ 
 
    scale_fill_manual(values = fill_colors)+ 
    theme_classic()+ 
    coord_flip(expand = FALSE)+ 
    # use a facet strip, 
    # to serve as a title, but with color 
    facet_grid(. ~ character, labellerlabeller = labeller(character = character_labeler))+ 
    # figure out color swatch for the facet strip fill 
    # using character name to index the color palette 
    # color= NA means there's no outline color. 
    theme(strip.background = element_rect(fill = character_palette[character_name], 
                                          color = NA))+ 
    # other theme elements 
    theme(strip.text.x = element_text(size = rel(1.15), color = "white"), 
          text = element_text(family = "Franklin"), 
          legend.position = "none", 
          panel.grid = element_blank(), 
          axis.text.x = element_text(size = rel(0.8)))+ 
    # omit the axis title for the individual plot, 
    # because we'll have one for the entire ensemble 
    theme(axis.title = element_blank()) 
  return(output_plot) 
}

單個角色是如何設置?

sample_plot <- plot_char("black_panther")+ 
  theme(axis.title = element_text())+ 
  # x lab is still declared as y lab 
  # because of coord_flip() 
  ylab(plot_x_axis_text) 
sample_plot

橫軸為什么這么特殊?因為隨著數值的增加，條形圖會變得越來越高，因此需要轉換刻度。

如下所示☟

logit2prob <- function(logit){ 
  odds <- exp(logit) 
  prob <- odds / (1 + odds) 
  return(prob) 
}

…這就是這個軸的樣子：

logit2prob(seq(0, 2.5, 0.5))

## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418

注意該列表中連續項之間的遞減差異：

diff(logit2prob(seq(0, 2.5, 0.5)))

## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474

好了，可以進行下一項了：探討一些細節，并把上面設置的函數應用到所有角色的列表中，并把所有的結果放入一個列表中。

all_plots <- lapply(character_names, plot_char)

從圖片中提取標題

get_axis_grob <- function(plot_to_pick, which_axis){ 
  # plot_to_pick <- sample_plot 
  tmp <- ggplot_gtable(ggplot_build(plot_to_pick)) 
  # tmp$grobs 
  # find the grob that looks like 
  # it would be the x axis 
  axis_x_index <- which(sapply(tmp$grobs, function(x){ 
    # for all the grobs, 
    # return the index of the one 
    # where you can find the text 
    # "axis.title.x" or "axis.title.y" 
    # based on input argument `which_axis` 
    grepl(paste0("axis.title.",which_axis), x)} 
  )) 
  axis_grob <- tmp$grobs[[axis_x_index]] 
  return(axis_grob) 
}

提取軸標題：

px_axis_x <- get_axis_grob(sample_plot, "x") 
px_axis_y <- get_axis_grob(sample_plot, "y")

下面是如何使用提取出來的坐標軸：

grid.newpage() 
grid.draw(px_axis_x)

# grid.draw(px_axis_y)

匯總所有的英雄：

big_plot <- arrangeGrob(grobs = all_plots)

加入圖注，注意圖和坐標軸的比例關系：

big_plot_w_x_axis_title <- arrangeGrob(big_plot, 
                                       px_axis_x, 
                                       heights = c(10,1)) 
grid.newpage() 
grid.draw(big_plot_w_x_axis_title)

因為詞匯的長度不同，這些圖表占用的頁面空間略有不同。

所以，這看起來有點亂。

一般來說，我們使用facet_grid()或facet_wrap()確保在繪圖的過程中保持整齊和對齊，這個項目中不再適用，因為每個都有自己的自定義背景圖像。

使用Cowplot而不是arrangebrob，讓圖片的軸垂直對齊：

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)

增加X軸的標題，和之前類似，注意網格對齊：

big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned, 
                                               px_axis_x, 
                                               heights = c(10,1))

然后，大功告成☟

然后，保存一下!

ggsave(big_plot_w_x_axis_title_aligned, file = "Avengers_Word_Usage.png", width = 12, height = 6.3)

相關報道：

https://towardsdatascience.com/i-analyzed-marvel-movie-scripts-to-learn-what-each-avenger-says-most-2e5e7b6105bf
http://mattwinn.com/Plot_Avengers_data.html

【本文是51CTO專欄機構大數據文摘的原創文章，微信公眾號“大數據文摘（ id: BigDataDigest）”】

戳這里，看該作者更多好文

責任編輯：趙寧寧來源： 51CTO專欄

復仇者聯盟數據分析可視化

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

我們分析了復仇者聯盟系列所有臺詞，看看英雄們都愛說什么？沒有劇透！