Webscrape

Data Mining on Forums

In this post I will mine the posts on a public aviation forum in order to see the whether if the posters of the forum who are mostly pilots and aviation enthousiasts, had an idea of the issue that caused the crash of the boeing 737 max. In order to so I retive all the posts on the forum, mine all the post and answers to those posts and analyze the occurence of different key words.

In [15]:
import requests
import urllib.request
import time
import re
from bs4 import BeautifulSoup
import collections
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
%matplotlib inline

First, we decide what forum to focus on. For this project we decide to focus on the safety forum.

In [16]:
url = "https://www.airlinepilotforums.com/safety/index"

Actual Mining

We then collect the URLs of all the threads available on this forum.

In [17]:
allMainUrls =  []
for urlPage in range(1, 55):
    response = requests.get(url + str(urlPage) + ".html")
    soup = BeautifulSoup(response.text, "html.parser")
    urls = soup.findAll("a", attrs={'id' : re.compile("thread_title_[0-9]+")}, href = True)
    urls = [url["href"] for url in urls]
    
    allMainUrls += urls
allMainUrls[:10]
Out[17]:
['https://www.airlinepilotforums.com/safety/123205-boeing-reportedly-kept-faa-dark.html',
 'https://www.airlinepilotforums.com/safety/123174-roots-boeing-s-737-max-crisis.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash.html',
 'https://www.airlinepilotforums.com/safety/123017-boeing-737-max-crisis-leadership.html',
 'https://www.airlinepilotforums.com/safety/122991-survey-circulating.html',
 'https://www.airlinepilotforums.com/safety/122384-757-won-t-return-service.html',
 'https://www.airlinepilotforums.com/safety/122875-faa-urging-upping-braking-performance-margins.html',
 'https://www.airlinepilotforums.com/safety/122814-boeing-737-max-s-autopilot-has-problem.html',
 'https://www.airlinepilotforums.com/safety/122841-dl-engine-fail-hub-cone-vs-fan.html',
 'https://www.airlinepilotforums.com/safety/122778-vehicle-safety.html']
For every single on of these threads, we retrive the number of pages avilable in the thread.
In [18]:
# get all the thread pages
pages = []
for url in allMainUrls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    maxPage = soup.findAll("td", {"class" : "vbmenu_control", "style" : "font-weight:normal"},)
    if maxPage:
        pages.append(int([page.text for page in maxPage][0][-2:].strip()))
    else:
        pages.append(1)
pages[:10]
Out[18]:
[1, 1, 87, 1, 1, 9, 1, 1, 1, 1]

Using the number of pages avialable we found for each thread, we mine collect all the URLs where a post or comment is available.

In [19]:
allUrls = []
for i, url in enumerate(allMainUrls):
    for page in range(1, pages[i]+1):
        if page != 1:
            allUrls.append(url[:-5] + "-" + str(page) + ".html")
        else:
            allUrls.append(url)
allUrls[:10]
Out[19]:
['https://www.airlinepilotforums.com/safety/123205-boeing-reportedly-kept-faa-dark.html',
 'https://www.airlinepilotforums.com/safety/123174-roots-boeing-s-737-max-crisis.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-2.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-3.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-4.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-5.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-6.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-7.html',
 'https://www.airlinepilotforums.com/safety/120514-ethiopian-737-max-8-crash-8.html']

We define a function that allows us to collect the comments on each page and their respective date.

In [20]:
# Method to mine data in all subthreads
def pullThreadData(urls):
    allPosts = []
    allDates = []
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        text = soup.findAll("div", attrs =  {'id' : re.compile("post_message_[0-9]+")})
        posts = [t.text for t in text][:-1]
        posts = [re.sub("\\t|\\n|\\r", " ", post).strip() for post in posts]
        allPosts  += posts
        responsey = requests.get(url)
        soupy = BeautifulSoup(responsey.text, "html.parser")
        dates = [date[:10] for date in re.findall("[0-9]{2}-[0-9]{2}-201[0-9], [0-9]{2}:[0-9]{2}", str(soupy))]
        allDates += dates

    d = defaultdict(list)
    for date, post in zip(allDates,allPosts):
        d[date].append(post)

    for key, value in d.items():
        d[key] = " ".join(value)
    
    return d    

We store the data we find in the array below.

In [21]:
threadsByDay = pullThreadData(allUrls[1:])
In [24]:
threadsByDay
list(threadsByDay.keys())[:10]
Out[24]:
['03-10-2019',
 '03-11-2019',
 '03-12-2019',
 '03-13-2019',
 '03-14-2019',
 '03-15-2019',
 '03-16-2019',
 '03-17-2019',
 '03-18-2019',
 '03-19-2019']

We define certain stop words that should not be counted when the number of occurences of each word are analyzed.

In [25]:
stopwords = [ "a", "about", " ","above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "posted" ,"not", ",", "." ,"no", "-","would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself","quote" , "yourselves" ]
In [ ]:
We code a short algorithm that pulls and counts the number of words that occured. This algorithm then ranks all the words in terms of occurences.
In [379]:
#Start algo to extract most used numbers


# Read input file, note the encoding is specified here 
# It may be different in your text file


# Instantiate a dictionary, and for every word in the file, 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

# Print most common word
n_print = 100
words = []
counts = []
dates = []

# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for date, post in threadsByDay.items():
    for word in post.lower().split():
        word = word.replace(".","")
        word = word.replace(",","")
        word = word.replace(":","")
        word = word.replace("\"","")
        word = word.replace("!","")
        word = word.replace("“","")
        word = word.replace("‘","")
        word = word.replace("*","")
        if word not in stopwords:
            if word not in wordcount:
                wordcount[word] = 1
            else:
                wordcount[word] += 1

    # print("\nOK. The {} most common words are as follows\n".format(n_print))
    word_counter = collections.Counter(wordcount)
    for word, count in word_counter.most_common(n_print):
        words.append(word)
        counts.append(count)
    
    dates += [date]*100
#         print(word, ": ", count)
#     print("---------------------------------\n")


    
    
    
# Create a data frame of the most common words 
# Draw a bar chart


# lst = word_counter.most_common(n_print)
# df = pd.DataFrame(lst, columns = ['Word', 'Count'])
# df.plot.bar(x='Word',y='Count')

We then place the data obtained in a data frame.

In [380]:
data = pd.DataFrame({"date" : dates, "word" : words, "count" : counts})
In [381]:
data
Out[381]:
date word count
0 03-10-2019 originally 70
1 03-10-2019 mcas 60
2 03-10-2019 system 37
3 03-10-2019 boeing 36
4 03-10-2019 aircraft 35
5 03-10-2019 trim 34
6 03-10-2019 off 28
7 03-10-2019 problem 24
8 03-10-2019 know 23
9 03-10-2019 max 21
10 03-10-2019 crashed 19
11 03-10-2019 pilots 19
12 03-10-2019 just 18
13 03-10-2019 now 18
14 03-10-2019 said 16
15 03-10-2019 right 16
16 03-10-2019 nose 16
17 03-10-2019 like 16
18 03-10-2019 accident 16
19 03-10-2019 time 15
20 03-10-2019 think 15
21 03-10-2019 want 15
22 03-10-2019 can 15
23 03-10-2019 multiple 15
24 03-10-2019 200 14
25 03-10-2019 stab 14
26 03-10-2019 adlerdriver 14
27 03-10-2019 many 14
28 03-10-2019 much 14
29 03-10-2019 cnn 14
... ... ... ...
229870 01-21-2011 report 1600
229871 01-21-2011 day 1593
229872 01-21-2011 probably 1563
229873 01-21-2011 guys 1559
229874 01-21-2011 ntsb 1556
229875 01-21-2011 fire 1532
229876 01-21-2011 system 1492
229877 01-21-2011 want 1483
229878 01-21-2011 lot 1481
229879 01-21-2011 engine 1481
229880 01-21-2011 airline 1480
229881 01-21-2011 last 1478
229882 01-21-2011 another 1466
229883 01-21-2011 can't 1462
229884 01-21-2011 thing 1462
229885 01-21-2011 speed 1440
229886 01-21-2011 maybe 1435
229887 01-21-2011 hours 1430
229888 01-21-2011 every 1410
229889 01-21-2011 made 1408
229890 01-21-2011 doesn't 1404
229891 01-21-2011 used 1402
229892 01-21-2011 aviation 1398
229893 01-21-2011 long 1397
229894 01-21-2011 always 1392
229895 01-21-2011 things 1377
229896 01-21-2011 high 1375
229897 01-21-2011 ground 1371
229898 01-21-2011 airlines 1366
229899 01-21-2011 wrong 1326

229900 rows × 3 columns

Finally, we output the data we found in a CSV file to the notebook's directory

In [382]:
data.to_csv("word_frequency.csv", sep='\t')