python - Fastest way to replace space for underscore for a list of words in text -
given 10,000,000,000 lines of around 20-50 words each line, e.g.:
anarchism defined political philosophy holds state undesirable , unnecessary , or harmful . , others argue while anti-statism central , inadequate define anarchism . therefore , argue instead anarchism entails opposing authority or hierarchical organization in conduct of human relations , including , not limited , state system . proponents of anarchism , known " anarchists " , advocate stateless societies based on non - hierarchical free association s. subtle , anti-dogmatic philosophy , anarchism draws on many currents of thought , strategy . anarchism not offer fixed body of doctrine single particular world view , instead fluxing , flowing philosophy . there many types , traditions of anarchism , not of mutually exclusive . anarchist schools of thought can differ fundamentally , supporting extreme individualism complete collectivism . strains of anarchism have been divided categories of social , individualist anarchism or similar dual classifications . anarchism considered radical left-wing ideology , , of anarchist economics , anarchist legal philosophy reflect anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics . anarchism mass social movement has regularly endured fluctuations in popularity . central tendency of anarchism social movement has been represented anarcho-communism , anarcho-syndicalism , individualist anarchism being literary phenomenon nevertheless did have impact on bigger currents , individualists have participated in large anarchist organizations . many anarchists oppose forms of aggression , supporting self-defense or non-violence ( anarcho-pacifism ) , while others have supported use of coercive measures , including violent revolution , propaganda of deed , on path anarchist society . etymology , terminology term derives ancient greek ἄναρχος , anarchos , meaning " without rulers " , prefix ἀν - ( - , " without " ) + ἀρχός ( arkhos , " leader " , ἀρχή arkhē , " authority , sovereignty , realm , magistracy " ) + - ισμός ( - ismos , suffix - ιζειν , - izein " - izing " ) . " anarchists " term adopted maximilien de robespierre attack on left whom had used own ends during french revolution determined rid of , though among these " anarchists " there few exhibited social revolt characteristics of later anarchists . there many revolutionaries of nineteenth century contributed anarchist doctrines of next generation , such william godwin , wilhelm weitling , did not use word " anarchist " or " anarchism " in describing or beliefs . pierre-joseph proudhon first political philosopher call himself anarchist , making formal birth of anarchism mid-nineteenth century . since 1890s france , term " libertarianism " has been used synonym anarchism , used exclusively in sense until 1950s in united states ; use synonym still common outside united states . on other hand , use " libertarianism " refer individualistic free-market philosophy , referring free-market anarchism " libertarian anarchism " .
and let's have list of dictionary terms made of 1 or more words, e.g:
clinical anatomy clinical psychology cognitive neuroscience cognitive psychology cognitive science comparative anatomy comparative psychology compound morphology computational linguistics correlation cosmetic dentistry cosmography cosmology craniology craniometry criminology cryobiology cryogenics cryonics cryptanalysis crystallography curvilinear correlation cybernetics cytogenetics cytology deixis demography dental anatomy dental surgery dentistry philosophy political philosophy
and need find sentences contains of these terms , replace spaces between words within terms underscores.
for example, there sentence in text:
anarchism defined political philosophy holds state undesirable , unnecessary , or harmful .
and there's dictionary term political philosophy
in text. output sentence needs be:
anarchism defined political_philosophy holds state undesirable , unnecessary , or harmful .
i this:
dictionary = sort(dictionary, key=len) # replace longest terms first. line in text: term in dictionary: if term in line: line = line.replace(term, term.replace(' ', '_'))
assuming have 10,000 dictionary terms (d) , 10,000,000,000 sentences (s), complexity of using simple method o(d*s)
, right? is there faster , less brute-forcey way achieve same results?
is there way replace of terms spaces terms underscore each line? in avoiding inner loop.
would more efficient if index text using whoosh
first query index , replace terms? still need o(1*s)
replacements, right?
the solution doesn't need in python, if it's unix command tricks grep/sed/awk, it's fine, long subprocess.popen
-able.
please correct complexity assumptions if i'm wrong, pardon noobiness.
given sentence:
this sentence contains multiple phrases need replace phrases underscores, e.g. social political philosophy political philosophy under branch of philosophy , computational linguistics cognitive linguistics , psycho cognitive linguistics appears linguistics
and let's have dictionary:
cognitive linguistics psycho cognitive linguistics socio political philosophy political philosophy computational linguistics linguistics philosophy social political philosophy
the output should like:
this sentence contains multiple phrases need replace phrases underscores, e.g. social_political_philosophy political_philosophy under branch of philosophy , computational_linguistics cognitive_linguistics , psycho_cognitive_linguistics appears linguistics
and aim text file of 10 billion lines , dictionary of 10-100k phrases.
it may better split words, mapping words start of phrase full phrase, if need largest, instead of checking every item in dict need sort phrases appear length:
from collections import defaultdict def get_phrases(fle): phrase_dict = defaultdict(list) open(fle) ph: line in map(str.rstrip, ph): k, _, phr = line.partition(" ") phrase_dict[k].append(line) return phrase_dict itertools import chain def replace(fle, dct): open(fle) f: line in f: phrases = sorted(chain.from_iterable(dct[word] word in line.split() if word in dct) ,reverse=1, key=len) phr in phrases: line = line.replace(phr, phr.replace(" ", "_")) yield line
output:
in [10]: cat out.txt sentence contains multiple phrases need replace phrases underscores, e.g. social political philosophy political philosophy under branch of philosophy , computational linguistics cognitive linguistics , psycho cognitive linguistics appears linguistics in [11]: cat phrases.txt cognitive linguistics psycho cognitive linguistics socio political philosophy political philosophy computational linguistics linguistics philosophy social political philosophy in [12]: list(replace("out.txt",get_phrases("phrases.txt"))) out[12]: ['this sentence contains multiple phrases need replace phrases underscores, e.g. social_political_philosophy political_philosophy under branch of philosophy , computational_linguistics cognitive_linguistics , psycho_cognitive_linguistics appears linguistics']
a few other versions:
def repl(x): if x: return x.group().replace(" ", "_") return x def replace_re(fle, dct): open(fle) f: line in f: spl = set(line.split()) phrases = chain.from_iterable(dct[word] word in spl if word in dct) line = re.sub("|".join(phrases), repl, line) yield line def replace_re2(fle, dct): cached = {} open(fle) f: line in f: phrases = tuple(chain.from_iterable(dct[word] word in set(line.split()) if word in dct)) if phrases not in cached: r = re.compile("|".join(phrases)) cached[phrases] = r line = r.sub(repl, line) else: line = cached[phrases].sub(repl, line) yield line
Comments
Post a Comment