I have a table T1:
Col ------- 1 THE APPLE THE APPLE THE APPLE 123 THE APPLE 12/16 BEST THE APPLE
I want T2:
Result -------- THE APPLE
I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.Answer1:
This question requires a fair degree of complication to solve and it's running time will drastically increase with increasing string lengths and number of records. However, given that your table T1 is rather small, you might just manage with the below PL/pgSQL function.<h2>Algorithm</h2> <ol><li>Find the shortest value in T1(col). This is the longest possible match among all records. This is the candidate string.</li> <li>See if the candidate is present in all other rows of T1. If so, return the present candidate to the result set.</li> <li>Move the candidate over one position in the shortest value and back to step 2 until the candidate reaches the end of the shortest string.</li> <li>If matching candidates have been found, return from the function. Otherwise, shorten the candidate by 1 and start over from the beginning of the shortest string and move to step 2. If no more candidates can be extracted from the shortest string, then return
The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match
col to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.
The string comparison is case-sensitive; If you want to make it case-insensitive, change
ILIKE. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a
SELECT DISTINCT * to get rid of those duplicates.
CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$ DECLARE shortest varchar; -- The shortest string in T1(col) so the longest possible match candidate varchar; -- Candidate string to test sz_sh integer; -- Length of "shortest" l integer := 1; -- Starting position of "candidate" in "shortest" sz integer; -- Length of "candidate" fail boolean; -- Has "candidate" been found in T1(col)? found_one boolean := false; -- Flag if we found at least one match BEGIN -- Find the shortest string and its size, don't worry about multiples, need just 1 SELECT col, char_length(col) INTO shortest, sz_sh FROM T1 ORDER BY char_length(col) ASC NULLS LAST LIMIT 1; -- Get all the candidates from the shortest string and test them from longest to single char candidate := shortest; sz := sz_sh; LOOP -- Check rows in T1 if they contain the candidate string. -- Short-circuit as soon as a record does not match the candidate <<check_T1>> BEGIN FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP EXIT check_T1 WHEN fail; END LOOP; -- Block was not exited, so the candidate is present in all rows: we have a match RETURN NEXT candidate; found_one := true; END; -- Produce the next candidate IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest" -- Exit if we already have at least one matching candidate EXIT WHEN found_one; -- .. otherwise shorthen the candidate sz := sz - 1; l := 1; ELSE -- Exit with empty result when all candidates have been examined EXIT WHEN l = sz_sh; -- .. otherwise move one position over to get the next candidate l := l + 1; END IF; candidate := substring(shortest from l for sz); END LOOP; RETURN; END; $$ LANGUAGE plpgsql IMMUTABLE;
SELECT * FROM find_longest_string_in_T1() should do the trick.
Create some test data:
INSERT INTO T1 SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).Answer2:
If you're okay with getting the most commonly appearing word among all rows (the most common word that is separated by a space), you could use:
select word, count(distinct rn) as num_rows from( select unnest(string_to_array(col, ' ')) as word, row_number() over(order by col) as rn from tbl ) x group by word order by num_rows desc
<strong>Fiddle:</strong> <a href="http://sqlfiddle.com/#!15/bc803/9/0" rel="nofollow">http://sqlfiddle.com/#!15/bc803/9/0</a>
Note that this finds the word
apple among 4 rows, not 5. This is because
APPLE123 is one word, whereas
APPLE 123 would be two words, one of which is APPLE, and would count, but it doesn't.