HW3

Recording

Slides

Type

这节课的作业主要是分析即为大佬推特的推文，包括发推设备的分布、发推时间的分布。然后分析了推文情感倾向。

在设备的分布上，用了正则表达式从文本中提取了发布设备

在发推时间分布上，同.tz_convert 转化了时区的分布

在分析推文情感倾向时使用了VADER（Valence Aware Dictionary and sEntiment Reasoner）词典，VADER是一个基于词典和规则的情感分析工具，专门针对在社交媒体中表达的情感。具体做法为先删掉推文中的非字符符号，然后匹配计算每个单词的情绪值，最后加起来。可以用df..explode() 函数将某一列的单词分开，分为多列表示，这样好直接计算情绪倾向



def to_tidy_format(df):
    tidy = (
        df["clean_text"]
        .str.split()
        .explode()
        .to_frame()
        .rename(columns={"clean_text": "word"})
    )
    return tidy

匹配html标签的内容：

删掉/替换标签及属性


import re
q2a_pattern = r"<[^>]+>"
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.sub(q2a_pattern, "", test_str) # 用于在字符串中查找和替换匹配的文本。把匹配到的替换为了""

直接获取关注内容


q2b_pattern = r">(.*)<"
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.findall(q2b_pattern, test_str)

在pandas中，上述两种方式分别对应 ser.str.repalce(pattern, “”), ser.str.extract(pattern)