大数据的数据清洗利器是什么呢?
在这篇文章中,我们将引见一种新的关头字搜刮和互换的算法:Flashtext 算法。Flashtext 算法是一个高效的字符搜刮和互换算法。该算法的工夫复杂度不依托于搜刮或互换的字符的数量。比如,关于一个文档有 N 个字符,和一个有 M 个词的关头词库,那么工夫复杂度就是 O(N) 。这个算法比我们通俗的正则婚配法快很多,因为正则婚配的工夫复杂度是 O(M * N)。这个算法和 Aho Corasick 算法也有一点不合,因为它不婚配子字符串。
Flashtext 算法被设计为只婚配完全的单词。比如,我们输入一个单词 {Apple},那么这个算法就不会去婚配 “I like Pineapple” 中的 apple。这个算法也被设计为起首婚配最长字符串。在举个例子,比如我们有多么一个数据集 {Machine, Learning,Machine Learning},一个文档 “I like Machine Learning”,那么我们的算法只会去婚配 “Machine Learning” ,因为这是最长婚配。
这个算法我们曾在 Github 下面完成了,用的是 Python 措辞。
1. 引见
在信息检索范围,关头字搜刮和替代都是很罕有的结果。我们常常想要在一个特定的文本中搜刮特定的关头词,或许在文本中替代一个特定的关头词。
例如:
关头字搜刮:假定我们有一个软件工程师的简历 (D),我们具有一个 20k 的编程身手词库 corpus = {Java, python, javascript, machien learning, ...}。我们想要从这个词库 corpus 中哪些词在这个简历中呈现了,即 corpus ∩ D。
关头字互换:另外一个常常利用的用例是当我们有一个同义词的语料库(不合的拼写暗示的是同一个单词),比如 corpus = {javascript: [‘javascript’, ‘javascripting’, ‘java script’], ...} ,在软件工程师的简历中我们需要应用规范化的词,一切我们需要用一个规范化的词来互换不合简历中的同义词。
为了去措置上述这些结果,正则表达式是最常常利用的一个技能。当然正则表达式可以很好的措置这个结果,然则当我们的数据量增大年夜大年夜时,这个速度就会异常慢了。假定我们的文档达到百万级别时,那么这个运转工夫将达到几天的量级。比以下面的图1,正则表达式在一个 10k 的词库中查找 15k 个关头词的工夫差不多是 0.165 秒。然则关于 Flashtext 而言只需要 0.002 秒。是以,在这个结果上 Flashtext 的速度大年夜大年夜约比正则表达式快 82 倍。
跟着我们需要措置的字符愈来愈多,正则表达式的措置速度的确都是线性增加的。可是,Flashtext 的确是一个常量。在本文中,我们将侧重评论辩论正则表达式与 Flashtext 之间的机能差别。我们还将具体的刻画 Flashtext 算法及其任务事理,和一些基准测试。
1.1 用于关头字搜刮的正则表达式
正则表达式是一种异常活络和有效的情势婚配编制。比如我们在文本中搜刮一个婚配 “\d{4}”,它暗示任何 4 位数字婚配,如 2017。我们应用 Python 代码可以完成多么一个功用,以下:
import re compiled_regex = re.compile(r'\b2017\b|\b\d{4}\b') compiled_regex.findall('In 2017 2311 is my birthday.') # output ['2017', '2311']
这里 ‘\b’ 用来暗示单词边界,它会去婚配出格字符,如 'space','period','new line' 等。
1.2 用于关头字互换的正则表达式
我们也能够应用正则表达式来制造一个规范化术语的互换脚本,比如我们可以编写一个 Python 脚本来用 “javascript” 互换 “java script”。以下:
import re re.sub(r"\bjava script\b", 'javascript', 'java script is awesome.') # output javascript is awesome.
2. Flashtext
Flashtext 是一种基于 Trie 字典数据布局和 Aho Corasick 的算法。它的任务编制是,起首它将一切相干的关头字作为输入。应用这些关头字建立一个 trie 字典,以下图3所示:
start 和 eot 是两个出格的字符,用来定义词的边界,这和我们下面提到的正则表达式是一样的。这个 trie 字典就是我们前面要用来搜刮和互换的数据布局。
2.1 应用 Flashtext 遏制搜刮
关于输入字符串(文档),我们对字符遏制一一遍历。当我们在文档中的字符序列 <\b>word<\b> 婚配到字典中的 <start>word<eot> 时(start 和 eot 辨别是字符序列的末尾标签和中断标签),我们觉得这是一个完全婚配了。我们将婚配到的字符序列所对应的规范关头字遏制输入,具体以下:
2.2 应用 Flashtext 遏制互换
关于输入字符串(文档),我们对字符遏制一一遍历它。我们先创建一个空的字符串,当我们字符序列中的 <\b>word<\b> 没法在 Trie 字典中找到婚配时,那么我们就简单的原始字符复制到前去字符串中。然则,当我们可以从 Trie 字典中找到婚配时,那么我们将将婚配到的字符的规范字符复制到前去字符串中。是以,前去字符串是输入字符串的一个正本,独一的不合是互换了婚配到的字符序列,具体以下:
2.3 Flashtext 算法
Flashtext 算法那首要分为三局部,我们接上去将对每局部遏制伶仃分解:
构建 Trie 字典;
关头字搜刮;
关头字互换;
2.3.1 构建 Trie 字典
为了构建 trie 字典,我们必须设立一个空的节点指向我们的空字典。这个节点被用作一切单词的终点。我们在字典中拔出一个单词。这个单词中的下一个字符在本字典中作为关头字,并且这个指针需要再次指向一个空字典。这个过程赓续反复,直到我们达到单词中的最后一个字符。当我们达到单词的末尾时,我们拔出一个出格的字符(eot)来暗示词尾。
输入
关头词 w = c1c2c3...cn,个中 ci 暗示一个输入字符。规范词我们用 s 来暗示。
代码:用于 Flashtext 的初始化并向字典添加关头字
class FlashText(object): def __init__(self, case_sensitive=False): self._keyword = '_keyword_' # end of term (eot) and key to store standardized name sef._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ',']) self.keyword_trie_dict = dict() sefl.case_sensitive = case_sensitive def add_keyword(self, keyword, clean_name = None): if not clean_name and keyword: clean_name = keyword if keyword and clean_name: # if both keyword and clean_name are not empty. if not self.case_sensitive: # if not case_sensitive then lowercase the keyword keyword = keyword.lower() current_dict = self.keyword_trie_dict for letter in keyword: current_dict = current_dict.setdefault(letter, {}) current_dict[self._keyword] = clean_name
输入
上述法度典型将会创建一个字典,如图3所示。
2.3.2 关头字搜刮
一旦我们将一切的词都添加到 trie 字典中,我们便可以在输入字符串中找到关头字。
输入
字符串 x = a1a2...an,个中 ai 是输入字符串 x 中的第 i 个字符。
代码:Python 代码用来获得字典中的输入字符串中的关头字。
def extract_keywords(self, sentence): keywords_extracted = [] if not self.case_sensitive: # if not case_sensitive then lowercase the sentence sentence = sentence.lower() current_dict = self.keyword_trie_dict sequence_and_pos = 0 idx = 0 sentence_len = len(sentence) while idx < sentence_len: char = sentence[idx] # when we reach a character that might denote word end if char not in self.non_word_boundaries: # if eot is present in current_dict if self._keyword in current_dict or char in current_dict: # update longest sequence found sequence_found = None longest_sequence_found = None is_longer_seq_found = False if self._keyword in current_dict: sequence_found = current_dict[self._keyword] longest_sequence_found = current_dict[self._keyword] sequence_end_pos = idx # re look for longest_sequence from this position if char in current_dict: current_dict_continued = current_dict[char] idy = idx + 1 while idy < sentence_len: inner_char = sentence[idy] if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued: # update longest sequence found longest_sequence_found = current_dict_continued[self._keyword] sequence_end_ops = idy is_longer_seq_found = True if inner_char in current_dict_continued: current_dict_continued = current_dict_continued[inner_char] else: break idy += 1 else: # end of sentence reached if self._keyword in current_dict_continued: # update longest sequence found longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if is_longer_seq_found: idx = sequence_end_pos current_dict = self.keyword_trie_dict if longest_sequence_found: keywords_extracted.append(longest_sequence_found) else: # we reset current_dict current_dict = self.keyword_trie_dict elif char in current_dict: # char is present in current dictionary position current_dict = current_dict[char] else: # we reset current_dict current_dict = self.keyword_trie_dict # skip to end of word idy = idx + 1 while idy < sentence_len: char = sentence[idy] if char not in self.non_word_boundaries: break idy += 1 idx = idy # if we are end of sentence and have a sequence discovered if idx + 1 >= sentence_len: if self._keyword in current_dict: sequence_found = current_dict[self._keyword] keywords_extracted.append(sequence_found) idx += 1 return keywords_extracted
输入
前去一个列表,输入字符串 x 中找到的一切规范化以后的字,如图 4 所示。
2.3.3 关头字互换
我们应用不异的字典用规范化的字来互换输入字符串中的关头字。
输入
输入字符串 x = a1a2...an,个中 ai 暗示第 i 个字符。
代码:Python 代码用于从输入字符串顶用规范词互换。
def replace_keywords(self, sentence): new_sentence = '' orig_sentence = sentence if not self.case_sensitive: sentence = sentence.lower() current_word = '' current_dict = self.keyword_trie_dict current_white_space = '' sequence_end_pos = 0 idx = 0 sentence_len = len(sentence) while idx < sentence_len: char = sentence[idx] current_word += orig_sentence[idx] # when we reach whitespace if char not in self.non_word_boundaries: current_white_space = char # if end is present in current_dict if self._keyword in current_dict or char in current_dict: # update longest sequence found sequence_found = None longest_sequence_found = None is_longer_seq_found = False if self._keyword in current_dict: sequence_found = curretn_dcit[self._keyword] longest_sequence_found = current_dict[self._keyword] sequence_end_pos = idx # re look for longest_sequence from this position if char in current_dict: current_dict_continued = current_dict[char] current_word_continued = current_word idy = idx + 1 while idy < sentence_len: inner_char = sentence[idy] current_word_continued += orig_sentence[idy] if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continuted: # update longest sequence found current_white_space = inner_char longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if inner_char in current_dict_continued: current_dict_continued = curretn_dict_continued[inner_char] else: break idy += 1 else: # end of sentence reached. if self._keyword in current_dict_continued: # update longest sequence found current_white_space = '' longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if is_longer_seq_found: idx = sequence_end_pos current_word = current_word_continued current_dict = self.keyword_trie_dict if longest_sequence_found: new_sentence += longest_sequence_found + curretn_white_space current_word = '' current_white_space = '' else: new_sentence += current_word current_word = '' current_white_space = '' else: # we reset current_dict current_dict = self.keyword_trie_dict new_sentence += current_word current_word = '' current_white_space = '' elif char in current_dict: # we can continue from this char current_dict = current_dict[char] else: # we reset current_dict current_dict = self.keyword_trie_dict # skip to end of word idy = idx + 1 while idy < sentence_len: char = sentence[idy] current_word += orig_sentence[idy] if char not in self.non_word_boundaries: break idy += 1 idx = idy new_sentence += current_word current_word = '' current_white_space = '' # if we are end of sentence and have a sequence disv=convered if idx + 1 >= sentence_len: if self._keyword in current_dict: sequence_found = current_dict[self._keyword] new_sentence += sequence_found idx += 1 return new_sentence
输入
在字符串 x 中找到需要互换的词,然后用规范词遏制互换输入,如图 5 所示。
3. Flashtext 和正则表达式的基准测试
正如在图 1 和图 2 中所揭示的,Flashtext 比正则表达式的措置速度要快很多。此刻我们来做一些基准测试来加倍诠释这个结果。
3.1 关头字搜刮
我们应用 Python 代码来完成这个关头字搜刮的基准测试。起首,我们会随机创建一个 10K 的语料库。然后,我们将从单词列表当选择 1K 的词用来创建一个新的文档。
我们将从语料库当选择 k 个词,个中 k ∈ {0, 1000, 2000, .. , 20000}。我们将应用正则表达式和 Flashtext 辨别对这个文档中的关头词遏制搜刮,然后对比分解。具体 Python 代码以下:
from flashtext.keyword import KeywordProcessor import random import re import string import time def get_word_of_length(str_length): # generate a random word of given length return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length)) # generate a list of 100K words of randomly chosen size all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)] print('Count | FlashText | Regex ') print('------------------------------') for keywords_length in [0, 1000, 5000, 10000, 15000]: # chose 1000 terms and create a string to search in. all_words_chosen = random.sample(all_words, 1000) story = ' '.join(all_words_chosen) # get unique keywrods from the list of words generated. unique_keywords_sublist = list(set(random.sample(all_words, keywords_length))) # compile Regex compiled_re = re.compile('|'.join([r'\b' + keyword + r'\b' for keyword in unique_keywords_sublist])) # add keywords to Flashtext keyword_processor = KeywordProcessor() keyword_processor.add_keywords_from_list(unique_keywords_sublist) # time the modules start = time.time() _ = keyword_processor.extract_keywords(story) mid = time.time() _ = compiled_re.findall(story) end = time.time() # print output print(str(keywords_length).ljust(6), '|', "{0:.5f}".format(mid - start).ljust(9), '|', "{0:.5f}".format(end - mid).ljust(9), '|') # output: Data for Figure 1
3.2 关头词互换
下面的代码是用来做关头词互换的 Python 代码。
from flashtext.keyword import KeywordProcessor import random import string import re import time def get_word_of_length(str_length): # generate a random word of given length return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length)) # generate a list of 100K words of randomly chosen size all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)] print('Count | FlashText | Regex ') print('-------------------------------') for keywords_length in range(1, 20002, 1000): # chose 5000 terms and create a string to search in. all_words_chosen = random.sample(all_words, 5000) story = ' '.join(all_words_chosen) # get unique keywords from the list of words generated. unique_keywords_sublist = list(set(random.sample(all_words, keywords_length))) # compile regex # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings rep = dict([(key, '_keyword_') for key in unique_keywords_sublist]) compiled_re = re.compile("|".join(rep.keys())) # add keywords to flashtext keyword_processor = KeywordProcessor() for keyword in unique_keywords_sublist: keyword_processor.add_keyword(keyword, '_keyword_') # time the modules start = time.time() _ = keyword_processor.replace_keywords(story) mid = time.time() _ = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story) end = time.time() # print output print(str(keywords_length).ljust(6), '|', "{0:.5f}".format(mid - start).ljust(9), '|', "{0:.5f}".format(end - mid).ljust(9), '|',) # output: Data for Figure 2
3.3 结论
正如我们鄙人面看到的对比结果,Flashtext 在关头字搜刮和互换下面比正则表达式都快很多。出格是在措置大年夜大年夜范围数据的时辰,这个优势会加倍的显示被表示出来。
4. Flashtext 应用文档
4.1 装配
pip install flashtext
4.2 应用例子
4.2.1 关头字提取
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> # keyword_processor.add_keyword(<unclean name>, <standardised name>) >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') >>> keywords_found >>> # ['New York', 'Bay Area']
4.2.2 关头字互换
>>> keyword_processor.add_keyword('New Delhi', 'NCR region') >>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') >>> new_sentence >>> # 'I love New York and NCR region.'
4.2.3 辨别大年夜大年夜小写字母
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor(case_sensitive=True) >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.') >>> keywords_found >>> # ['Bay Area']
4.2.4 关头字不清楚
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.') >>> keywords_found >>> # ['Big Apple', 'Bay Area']
4.2.5 同时添加多个关头词
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_dict = { >>> "java": ["java_2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> # {'clean_name': ['list of unclean names']} >>> keyword_processor.add_keywords_from_dict(keyword_dict) >>> # Or add keywords from a list: >>> keyword_processor.add_keywords_from_list(["java", "python"]) >>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform') >>> # output ['product management', 'java']
4.2.6 删除关头字
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_dict = { >>> "java": ["java_2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> keyword_processor.add_keywords_from_dict(keyword_dict) >>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform')) >>> # output ['product management', 'java'] >>> keyword_processor.remove_keyword('java_2e') >>> # you can also remove keywords from a list/ dictionary >>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]}) >>> keyword_processor.remove_keywords_from_list(["java programing"]) >>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform') >>> # output ['product management']
有时辰我们会将一些出格符号作为字符边界,比如 空格,\ 等等。为了从头设定字边界,我们需要添加一些符号通知算法,这是单词字符的一局部。
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple') >>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.')) >>> # ['Big Apple'] >>> keyword_processor.add_non_word_boundary('/') >>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.')) >>> # []
4.3 API 文档
具体的 API 文档,你可以点击这里遏制查抄。
参考:
Stack overflow
Flashtext: github
Flashtext: paper
Flashtext: medium
Flashtext: API doc