Regex

Recording

recording

Slides

slides

Type

Lecture

先介绍了python处理字符串的方法，然后再引入了正则表达式

引入正则表达式基础正则表达式扩展方便的正则表达式 Regex in Python and Pandas (Regex groups)对比

引入

对于文本数据，一般需要规范化和抽取信息

规范化：将具有多种可能表现形式的数据转换为标准形式，当然也可以使用字符串处理方式，例如replace split lower 等


def canonicalize_county(county_name):
    return (
        county_name
        .lower()               # lower case
        .replace(' ', '')      # remove spaces
        .replace('&', 'and')   # replace &
        .replace('.', '')      # remove dot
        .replace('county', '') # remove county
        .replace('parish', '') # remove parish
    )

Extract information into a new feature.


pertinent = first.split("[")[1].split(']')[0]
day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')
day, month, year, hour, minute, seconds, time_zone

python 和 pandas 对字符串的一些操作

operation	Python	pandas (Series)
transformation	s.lower()s.upper()	ser.str.lower()ser.str.upper()
replacement/deletion	s.replace(…)	ser.str.replace(…)
split	s.split(…)	ser.str.split(…)
substring	s[1:4]	ser.str[1:4]
membership	'ab' in s	ser.str.contains(…)
length	len(s)	ser.str.len()

缺乏灵活性，因此需要正则表达式

正则表达式基础

形式语言是一组字符串，通常以隐含方式描述。例如： "包含'data'的所有长度小于10的字符串的集合"

正则语言是一种可以由正则表达式描述的形式语言。

正则表达式（"regex"）是一个指定搜索模式的字符序列。

字符序列，它指定了一种搜索模式。

例如： [0-9]{3}-[0-9]{2}-[0-9]{4} 表示 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.

不用记忆，只需要理解，然后对着参考表能写就好了，可以使用https://regex101.com/r/1SREie/1 来练习。

下表为常用的一些操作：

operation	order	example	matches	doesn’t match
concatenation	3	AABAAB	AABAAB	every other string
or	4	AA\|BAAB	AA BAAB	every other string
closure (zero or more)	2	AB*A	AA ABBBBBBA	AB ABABA
group (parenthesis)	1	A(A\|B)AAB	AAAAB ABAAB	every other string
ㅤ	ㅤ	(AB)*A	A ABABABABA	AA ABBA

group 其实就是用 () 来分组，|, *, () 是元字符。它们可以操作相邻的字符，* 可以匹配前面的 0 个或者多个字符（或组）； | 为或的意思。

例如，想要匹配年份，可以有 (1|2)[0-9]{3} 表示第一个字符为 1 或者 2, 后面又三个字符在0-9之间。注：{}保证了前面的字符出现了多少次，因此 * 是 {} 的一种特殊情况。

例如，Give a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).

m(oo(oo*)|uo(uo*))n

正则表达式扩展

例：

Give a regular expression for any lowercase string that has a repeated vowel (noon, peel, festoon, loop, oodles, etc).

[a-z]*(aa|ee|ii|oo|uu)[a-z]*

Give a regular expression for any string that contains both a lowercase letter and a number.


(.*([0-9]).*([a-z]).*)|(.*([a-z]).*([0-9]).*)

方便的正则表达式

Regex in Python and Pandas (Regex groups)

python中使用 re 库


import re 

text = "<div><td valign="top">Moo</td></div>"
pattern = r"<[^>]+>"  # 将所有的  <> 内的字符替换为 空字符
re.sub(pattern, '', text) # returns Moo

pandas 中利用 .str.replace() 或者 ser.str.findall(pattern)


ser.str.replace(pattern, repl, regex=True )
Returns Series with all instances of pattern in Series ser replaced by repl.


df["Html"].str.replace(pattern, '')


text = """Observations: 03:04:53 - Horse awakens.
03:05:14 - Horse goes back to sleep."""       
pattern = "(\d\d):(\d\d):(\d\d) - (.*)"
matches = re.findall(pattern, text)

# 输出
[('03', '04', '53', 'Horse awakens.'),
 ('03', '05', '14', 'Horse goes back to sleep.')]

速记


\d 是[0-9]
\w 是 [a-zA-Z0-9_]
\s 是任意数量的空白字符， \S是其补集
.表示任意字符
*表示前个字符出现0次或者多次
+ 表示前面的字符至少出现一次
？表示前面的字符出现0次或者1次
{m,n}前面的字符出现的次数必须在 m n 之间
()  用来创建一个 子表达式
^ $ 分别限定 匹配字符串的开头和结尾

对比

Python String	re	pandas Series
s.lower()s.upper()		ser.str.lower()ser.str.upper()
s.replace(…)	re.sub(…)	ser.str.replace(…)
s.split(…)	re.split(…)	ser.str.split(…)
s[1:4]		ser.str[1:4]
	re.findall(…)	ser.str.findall(…)ser.str.extractall(…)ser.str.extract(…)
'ab' in s	re.search(…)	ser.str.contains(…)
len(s)		ser.str.len()
s.strip()		ser.str.strip()