Hw09 题解 (UCB CS61A@2021 Fall)
Q2: Roman Numerals
Write a regular expression that finds any string of letters that resemble a Roman numeral and aren’t part of another word. A Roman numeral is made up of the letters I, V, X, L, C, D, M and is at least one letter long.
For the purposes of this problem, don’t worry about whether or not a Roman numeral is valid. For example, “VIIIII” is not a Roman numeral, but it is fine if your regex matches it.
要我们写正则表达式, 注意题目给出的关键信息:
- 可能出现的字符是 I, V, X, L, C, D, M, 表示或这个关系可以用
[]
- 至少会出现 1 个字符. 表达至少有一个在正则表达式中可以用
+
- 这些字母出现的时候不能是其他单词的一部分. 这个可以用
\b
. 根据文档里说的, 它是用来匹配单词开头或者是结尾的(Matches the empty string, but only at the beginning or end of a word)
def roman_numerals(text):
"""
Finds any string of letters that could be a Roman numeral
(made up of the letters I, V, X, L, C, D, M).
"""
return re.findall(r'\b[IVXLCDM]+\b', text)
Q3: CS Classes
On reddit.com, there is an /r/berkeley subreddit for discussions about everything UC Berkeley. However, there is such a large amount of CS-related posts that those posts are auto-tagged so that readers can choose to ignore them or read only them.
Write a regular expression that finds strings that resemble a CS class- starting with “CS”, followed by a number, and then optionally followed by “A”, “B”, or “C”. Your search should be case insensitive, so both “CS61A” and “cs61a” would match.
提取正则表达式的关键信息:
- 不管是开头的
CS
还是末尾的ABC
都是大小写不敏感的. 用[]
枚举即可 CS
和后面的数字之间可能有空格, 这个我们用?
即可.
def cs_classes(post):
"""
Returns strings that look like a Berkeley CS class,
starting with "CS", followed by a number, optionally ending with A, B, or C
and potentially with a space between "CS" and the number.
Case insensitive.
"""
return bool(re.search(r'[Cc][Ss] ?\d+[ABCabc]?', post))
Q4: Time for Times
You’re given a body of text and told that within it are some times. Write a regular expression which, for a few examples, would match the following:
['05:24', '7:23', '23:59', '12:22', '00:00']
but would not match these invalid “times”
['05:64', '70:23']
You may find non-capturing groups helpful to use for this question.
也就是要匹配时间的格式, 而且要是有效时间(00:00 ~ 23:59
), 注意要处理下面四个数字的关系
- 第一位: 可以有前导
0
或者省略,1
也是合法的. 难点在于2
, 因为我们知道当格式是2?:??
的时候最多只能到23:59
- 第二位: 这个在开头不是
2
的时候就是0 ~ 9
- 第三位和第四位:
00 ~ 59
这么一想这一道题还挺复杂 :(
然后我去看看提示发现了一个好用的写法 (?:...)
, 表示我们想匹配但是不把 ()
里的结果保存下来返回. 还有末尾的 AM
就可以用 (?:AM)?
来处理
def match_time(text):
return re.findall(r'(?:[01]?\d|2[0-3]):[0-5][0-9](?:AM)?', text)
Q5: Most Common Area Code
Write a function which takes in a body of text and finds the most common area code. Area codes must be part of a valid phone number.
To solve this problem, we will first write a regular expression which finds valid phone numbers and captures the area code. See the docstring of
area_codes
for specifics on what qualifies as a valid phone number.
大家都知道电话前面会有区号, 现在这一题要求我们找到出现次数最多的区号, 首先要匹配所有的电话号码的区号(area_codes
). 然后返回出现次数最多的区号就好了.
注意正则表达式的特征:
- 电话号码有 10 位
- 区号是前三位, 有的会用括号
()
括起来有的不会 - 区号和电话号码之间以及电话号码的前三位和后四位之间可能有空格或者是
-
怎么表示可能出现的 ()
呢? 其实用上一题的 (?:...)?
这种格式就可以 🤗
def area_codes(text):
"""
Finds all phone numbers in text and captures the area code. Phone numbers
have 10 digits total and may have parentheses around the area code, and
hyphens or spaces after the third and sixth digits.
"""
return re.findall(r'(?:\()?(\d{3})(?:\)?)(?: |-)?\d{3}(?: |-)?\d{4}\b', text)
def most_common_code(text):
"""
Takes in an input string which contains at least one phone number (and
may contain more) and returns the most common area code among all phone
numbers in the input. If there are multiple area codes with the same
frequency, return the first one that appears in the input text.
"""
area_codes_list = area_codes(text)
cnts = [area_codes_list.count(e) for e in area_codes_list] # count every area_code
max_cnt_idx = cnts.index(max(cnts)) # get the index of the max value
return area_codes_list[max_cnt_idx]